Issue 163904
Summary MLIR NVGPU / NVVM TMA operations (cp.async.bulk.tensor.*) don't support signed integer coordinates
Labels mlir
Assignees
Reporter jwake
    I'm working on some MLIR codegen that needs to be able to generate occasional negative coordinates for a TMA multidimensional global -> shared memory copy, and I've noticed that the MLIR NVGPU dialect, the NVVM dialect, and the underlying LLVM NVPTX intrinsics for the TMA bulk copy instructions all require unsigned integers for the coordinates, contrary to the PTX documentation stating that the coordinates are all 32-bit signed ints (with the shared->global direction requiring that none of the components are negative)

It looks like `nvcc` just treats the values as unsigned ints in the PTX, eg.:
```cuda
__global__ void kernel(const __grid_constant__ CUtensorMap tensor_map, int x) {
    int y = -123;
...
 cde::cp_async_bulk_tensor_2d_global_to_shared(&smem_buffer, &tensor_map, x, y, bar);
```
becomes
```ptx
.visible .entry kernel(CUtensorMap_st, unsigned int)(
	.param .align 64 .b8 kernel(CUtensorMap_st, int)_param_0[128],
	.param .u32 kernel(CUtensorMap_st, int)_param_1,
)
...
	ld.param.u32 	%r4, [kernel(CUtensorMap_st, int)_param_1];
...
	mov.u32 	%r11, -123;
...
	cp.async.bulk.tensor.2d.shared::cluster.global.tile.mbarrier::complete_tx::bytes [%r10], [%rd8, {%r4, %r11}], [%r13];
```
so for now I'll just generate a similar bit cast in my MLIR lowering.

It seems like at the MLIR level at least, `nvgpu.tma.async.load`/`nvvm.cp.async.bulk.tensor.shared.cluster.global` should accept `si32`, `i32` and `index` while the corresponding store instructions remain unsigned, unless I'm missing something?

_______________________________________________
llvm-bugs mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs

Reply via email to