Issue |
98908
|
Summary |
[NVPTX] Optimization causes threads to erroneously enter if-block
|
Labels |
new issue
|
Assignees |
|
Reporter |
psalz
|
When implementing an optimized atomic counter using warp-level primitives (as e.g. described [here](https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/#opportunistic_warp-level_programming)), LLVM appears to erroneously enter a branch with every thread in a warp, even though only a subset of threads should be entering it.
I've reproduced the bug with Clang 14, Clang 17 as well as a recent manual build (59e56ee). The bug appears only when optimizations are enabled with `-O1` or higher. The program works correctly using `nvcc`.
Full reproducer:
```cuda
#include <cuda_runtime.h>
#include <cassert>
#include <cstdint>
#include <cstdio>
#include <vector>
#define CUDA_CHECK(fn, ...) \
{ \
cudaError_t status = fn(__VA_ARGS__); \
if(status != cudaSuccess) { \
fprintf(stderr, "CUDA Error in call %s on line %d: %s\n", #fn, __LINE__, cudaGetErrorString(status)); \
abort(); \
} \
}
__global__ static void kernel(const int* input, uint32_t* counters) {
const uint32_t idx = blockIdx.x * blockDim.x + threadIdx.x;
int data = ""
uint32_t* counter = &counters[data];
auto active_group = __activemask();
auto mask = __match_any_sync(active_group, data);
auto leader = __ffs(mask) - 1;
/* volatile */ uint32_t res; // making res volatile fixes the problem
auto laneid = threadIdx.x % 32;
if(laneid == leader) { // this branch is entered by every thread
res = atomicAdd(counter, __popc(mask));
printf("i am %d (tid=%d), leader is %u\n", laneid, threadIdx.x, leader);
}
res = __shfl_sync(mask, res, leader); // removing the warp shuffle altogether also fixes it
}
int main() {
constexpr size_t buffer_size = 32;
constexpr size_t input_size = 32;
std::vector<int> input_data(input_size);
int* input_d;
CUDA_CHECK(cudaMalloc, &input_d, input_data.size() * sizeof(int));
const auto max_values_per_slot = (input_data.size() + buffer_size - 1) / buffer_size;
for(size_t i = 0; i < input_data.size(); ++i) {
input_data[i] = i / 2;
}
uint32_t* counters_d;
CUDA_CHECK(cudaMalloc, &counters_d, buffer_size * sizeof(uint32_t));
CUDA_CHECK(cudaMemcpy, input_d, input_data.data(), input_data.size() * sizeof(int), cudaMemcpyHostToDevice);
const unsigned block_size = 32;
const unsigned grid_size = (input_size + block_size - 1) / block_size;
kernel<<<grid_size, block_size>>>(input_d, counters_d);
CUDA_CHECK(cudaFree, input_d);
CUDA_CHECK(cudaFree, counters_d);
return 0;
}
```
The program initializes a buffer of size 32 with 16 consecutive pairs of values (`0, 0, 1, 1, 2, 2, ...`). It then uses `__match_any_sync` to group threads that have the same value (i.e., 0 and 1, 2 and 3, and so on). Only the first thread in each group should then enter the block that does the `atomicAdd`. However, this is the output:
```
i am 0 (tid=0), leader is 0
i am 1 (tid=1), leader is 0
i am 2 (tid=2), leader is 2
i am 3 (tid=3), leader is 2
i am 4 (tid=4), leader is 4
i am 5 (tid=5), leader is 4
...
```
What it should look like:
```
i am 0 (tid=0), leader is 0
i am 2 (tid=2), leader is 2
i am 4 (tid=4), leader is 4
...
```
Since either marking `res` as volatile or removing the call to `__shfl_sync` altogether fixes the issue, I suspect that LLVM thinks it is cheaper to compute `res` in each thread instead of calling the warp collective - somehow disregarding the fact that both `atomicAdd` and `printf` aren't side-effect free..?
_______________________________________________
llvm-bugs mailing list
llvm-bugs@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs