[llvm-bugs] [Bug 131415] [cuclang 20.1.0] Using __reduce_max_sync in a cuda kernel fails with an illegal instruction was encountered

LLVM Bugs via llvm-bugs Sat, 15 Mar 2025 09:18:54 -0700

Issue	131415
Summary	[cuclang 20.1.0] Using __reduce_max_sync in a cuda kernel fails with an illegal instruction was encountered
Labels	new issue
Assignees
Reporter	AustinSchuh

    Using clang 20.1.0.  I've got a CUDA kernel which boils down to the following:


```
__global__ void testcode(const float* data, unsigned *max_value) {
    unsigned r = static_cast<unsigned>(data[threadIdx.x]);

    const unsigned mask = __ballot_sync(0xFFFFFFFF, true);

    unsigned mx = __reduce_max_sync(mask, r);
    atomicMax(max_value, mx);
}
```

When I run this with nvcc, it works, and with clang, I get an `an illegal instruction was encountered` message.

Compiler explorer confirms that they generate slightly different ptx.

nvcc
```
        ld.param.u64    %rd1, [testcode(float const*, unsigned int*)_param_0];
        ld.param.u64    %rd2, [testcode(float const*, unsigned int*)_param_1];
        cvta.to.global.u64      %rd3, %rd2;
        cvta.to.global.u64      %rd4, %rd1;
        mov.u32 %r1, %tid.x;
        mul.wide.u32    %rd5, %r1, 4;
        add.s64 %rd6, %rd4, %rd5;
        ld.global.f32   %f1, [%rd6];
 cvt.rzi.u32.f32         %r2, %f1;
        mov.pred        %p1, -1;
 mov.u32         %r3, -1;
        vote.sync.ballot.b32    %r4, %p1, %r3;
 redux.sync.max.u32 %r5, %r2, %r4;
        atom.global.max.u32     %r6, [%rd3], %r5;
        ret;
```

clang
```
        ld.param.u64    %rd1, [testcode(float const*, unsigned int*)_param_0];
        ld.param.u64 %rd2, [testcode(float const*, unsigned int*)_param_1];
 cvta.to.global.u64      %rd3, %rd2;
        cvta.to.global.u64      %rd4, %rd1;
        mov.u32         %r1, %tid.x;
        mul.wide.u32    %rd5, %r1, 4;
        add.s64         %rd6, %rd4, %rd5;
        ld.global.f32 %f1, [%rd6];
        cvt.rzi.u32.f32         %r2, %f1;
        mov.pred %p1, -1;
        vote.sync.ballot.b32    %r3, %p1, -1;
 redux.sync.max.u32 %r4, %r3, %r2;
        atom.global.max.u32     %r5, [%rd3], %r4;
        ret;
```

I hacked around and got it to work with:
```
__global__ void testcode2(const float* data, unsigned int* max_value) {
    unsigned int r = static_cast<unsigned int>(data[threadIdx.x]);
    unsigned int mask = __ballot_sync(0xFFFFFFFF, true);
    unsigned int mx;                                           
 asm volatile(                               
        " redux.sync.max.u32 %0, %2, %1;"
        : "=r"(mx) // Output operand (mask)     
        : "r"(mask), "r"(r)// No input operands          
 : "cc" // Clobbered registers and condition codes
    ); 
      
    atomicMax(max_value, mx); 
}      
```

The fix being to swap the mask and input on the redux.sync.

_______________________________________________
llvm-bugs mailing list
llvm-bugs@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs

[llvm-bugs] [Bug 131415] [cuclang 20.1.0] Using __reduce_max_sync in a cuda kernel fails with an illegal instruction was encountered

Reply via email to