CUDA实现并行求和

CUDA中的网格跨步循环

参考 https://www.codeleading.com/article/6260306169/
cuda中的kernel我们一般会这样写

__global__ void parallel_run(int n){
    for(int i = threadIdx.x + blockIdx.x * blockDim.x;i < n;i += blockDim.x * gridDim.x){
        //doing something..
    }
}

这里是对于网格跨步循环的官方解释：
Notice that the stride of the loop is blockDim.x * gridDim.x which is the total number of threads in the grid. So if there are 1280 threads in the grid, thread 0 will compute elements 0, 1280, 2560, etc. This is why I call this a grid-stride loop. By using a loop with stride equal to the grid size, we ensure that all addressing within warps is unit-stride, so we get maximum memory coalescing, just as in the monolithic version.

也就是说如果一个网格内有1280个线程，那么thread_0会计算成员的0,1280,2560…等一系列的数组索引。也就是线程1会同时将for循环内的操作应用到这一系列索引，从而实现了并行计算。

TLS（Thread-local-Storage）

如果我们需要对一个数组进行求和，利用网格跨步循环可以实现并行计算。

__global__ void parallel_for(int n, int* sum, int *arr){
    for(int i = threadIdx.x + blockDim.x * blockIdx.x;
    i < n; i += gridDim.x * blockDim.x){
        //我们很自然的想到了
        &sum[0] += arr[i];
    }
}

但是实际的运行下来我们会发现结果并不等于数组的和，这是因为我们的这一条语句*sum += arr[i];会被编译器解释为一系列的操作1. 读取寄存器1的值 2. 读取寄存器2的值 3. 寄存器1+寄存器2 4. 写回到数组中。

在多线程环境下就会出现data race，所以我们需要保证原子化的操作。可以直接使用cudaruntime.h为我们提供的atomicAdd()。

1	`atomicAdd(sum, arr[i]);`

这样的输出结果就是正确的了，但是又引入的新的问题，每一个线程都进行原子化的add操作，也就是将循环串行化了，就放弃了GPU并行计算的能力。该怎样进一步优化呢？

这就是TLS技术了，我们可以让每一个线程都进行线程本地存储,最后再进行相加。

__global__ void parallel_for(int n, int* sum, int *arr){
    int local_sum = 0;
    for(int i = threadIdx.x + blockDim.x * blockIdx.x;
    i < n; i += gridDim.x * blockDim.x){
        //我们很自然的想到了
        local_sum += arr[i];
    }

    atomicAdd(&sum[0], local_sum);
}

这样就实现了一个数组的求和操作。

#CUDA

CUDA实现并行求和

http://example.com/2026/01/18/cuda实现数组并行求和/

作者

Soya

发布于

2026年1月18日

许可协议

Qwen&GPT2 模型架构上一篇

CUDA共享内存下一篇