Issue 98908
Summary [NVPTX] Optimization causes threads to erroneously enter if-block
Labels new issue
Assignees
Reporter psalz
    When implementing an optimized atomic counter using warp-level primitives (as e.g. described [here](https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/#opportunistic_warp-level_programming)), LLVM appears to erroneously enter a branch with every thread in a warp, even though only a subset of threads should be entering it.

I've reproduced the bug with Clang 14, Clang 17 as well as a recent manual build (59e56ee). The bug appears only when optimizations are enabled with `-O1` or higher. The program works correctly using `nvcc`.

Full reproducer:

```cuda
#include <cuda_runtime.h>

#include <cassert>
#include <cstdint>
#include <cstdio>
#include <vector>

#define CUDA_CHECK(fn, ...) \
	{ \
		cudaError_t status = fn(__VA_ARGS__); \
		if(status != cudaSuccess) { \
			fprintf(stderr, "CUDA Error in call %s on line %d: %s\n", #fn, __LINE__, cudaGetErrorString(status)); \
			abort(); \
		} \
	}

__global__ static void kernel(const int* input, uint32_t* counters) {
	const uint32_t idx = blockIdx.x * blockDim.x + threadIdx.x;
	int data = ""
	uint32_t* counter = &counters[data];

	auto active_group = __activemask();
	auto mask = __match_any_sync(active_group, data);

	auto leader = __ffs(mask) - 1;
	/* volatile */ uint32_t res; // making res volatile fixes the problem
	auto laneid = threadIdx.x % 32;
	if(laneid == leader) { // this branch is entered by every thread
		res = atomicAdd(counter, __popc(mask));
		printf("i am %d (tid=%d), leader is %u\n", laneid, threadIdx.x, leader);
	}
	res = __shfl_sync(mask, res, leader); // removing the warp shuffle altogether also fixes it
}

int main() {
	constexpr size_t buffer_size = 32;
	constexpr size_t input_size = 32;
	std::vector<int> input_data(input_size);
	int* input_d;
	CUDA_CHECK(cudaMalloc, &input_d, input_data.size() * sizeof(int));
	const auto max_values_per_slot = (input_data.size() + buffer_size - 1) / buffer_size;
	for(size_t i = 0; i < input_data.size(); ++i) {
		input_data[i] = i / 2;
	}

	uint32_t* counters_d;
	CUDA_CHECK(cudaMalloc, &counters_d, buffer_size * sizeof(uint32_t));
	CUDA_CHECK(cudaMemcpy, input_d, input_data.data(), input_data.size() * sizeof(int), cudaMemcpyHostToDevice);

	const unsigned block_size = 32;
	const unsigned grid_size = (input_size + block_size - 1) / block_size;
	kernel<<<grid_size, block_size>>>(input_d, counters_d);

	CUDA_CHECK(cudaFree, input_d);
	CUDA_CHECK(cudaFree, counters_d);

	return 0;
}
```

The program initializes a buffer of size 32 with 16 consecutive pairs of values (`0, 0, 1, 1, 2, 2, ...`). It then uses `__match_any_sync` to group threads that have the same value (i.e., 0 and 1, 2 and 3, and so on). Only the first thread in each group should then enter the block that does the `atomicAdd`. However, this is the output:

```
i am 0 (tid=0), leader is 0
i am 1 (tid=1), leader is 0
i am 2 (tid=2), leader is 2
i am 3 (tid=3), leader is 2
i am 4 (tid=4), leader is 4
i am 5 (tid=5), leader is 4
...
```

What it should look like:

```
i am 0 (tid=0), leader is 0
i am 2 (tid=2), leader is 2
i am 4 (tid=4), leader is 4
...
```

Since either marking `res` as volatile or removing the call to `__shfl_sync` altogether fixes the issue, I suspect that LLVM thinks it is cheaper to compute `res` in each thread instead of calling the warp collective - somehow disregarding the fact that both `atomicAdd` and `printf` aren't side-effect free..?
_______________________________________________
llvm-bugs mailing list
llvm-bugs@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs

Reply via email to