I wonder is it possible for TVM to support CUDA warp-level sync operations? For
example, if I want to use shuffle intrinsics, what should I do? If not
possible, then I have to use shared memory. But then TVM will generate
syncthreads, which is an overkill. If I load and consume shared memory only in
warp, i.e. no memory shared across warps, I don't need syncthreads, right?
Let's look at the following naive example:
```python
import tvm
from tvm.te import hybrid
@hybrid.script
def demo_sync(indices):
out = output_tensor((indices.shape[0],), 'int32')
sm_i = allocate((128,), 'int32', 'shared')
for b in bind('blockIdx.x', indices.shape[0] // 128):
for y in range(4):
for x in bind('threadIdx.x', 32):
sm_i[y * 4 + x] = indices[b * 128 + y * 4 + x]
out[y * 4 + x] = sm_i[y * 4 + 31 - x]
# for i in range(32):
return out
indices = tvm.te.placeholder((1024,), 'int32', 'indices')
out = demo_sync(indices)
sched = tvm.te.create_schedule(out.op)
f = tvm.build(sched, [indices, out], target='cuda')
print(f.imported_modules[0].get_source())
```
It will generate following cuda code:
```C
extern "C" __global__ void default_function_kernel0(int* __restrict__ indices,
int* __restrict__ demo_sync) {
__shared__ int sm_i[128];
for (int y = 0; y < 4; ++y) {
__syncthreads();
sm_i[(((y * 4) + ((int)threadIdx.x)))] = indices[((((((int)blockIdx.x) *
128) + (y * 4)) + ((int)threadIdx.x)))];
__syncthreads();
demo_sync[(((y * 4) + ((int)threadIdx.x)))] = sm_i[((((y * 4) + 31) -
((int)threadIdx.x)))];
}
}
```
I think the two syncthreads is not necessary. Is it possible to fix it? Either
by using shuffle instead, or just do not generate sync operations, or use
syncwarp.
---
[Visit Topic](https://discuss.tvm.apache.org/t/tvm-cuda-warp-level-sync/8043/1)
to respond.
You are receiving this because you enabled mailing list mode.
To unsubscribe from these emails, [click
here](https://discuss.tvm.apache.org/email/unsubscribe/9b50730d800d2620c25497b687d51ca015ffa99382878c7995280256fc00ac6d).