https://llvm.org/bugs/show_bug.cgi?id=27738
Bug ID: 27738 Summary: ptxas miscompile with divergent branches and convergent operations Product: libraries Version: trunk Hardware: All OS: All Status: NEW Severity: normal Priority: P Component: Backend: PTX Assignee: unassignedb...@nondot.org Reporter: justin.le...@gmail.com CC: llvm-bugs@lists.llvm.org Classification: Unclassified We've discovered an apparent miscompile in ptxas with divergent branches and convergent operations. At the moment we do not have a workaround in LLVM, and I'm not sure one is possible in general. I'm filing this bug so that we have something to point people to if and when they hit this problem. As far as I can tell, this issue is not specific to LLVM; I think it should affect any compiler which emits ptx, including nvcc. == Background == When targeting an nvidia GPU, LLVM compiles IR to ptx, a high-level, device-independent assembly language. clang then passes the generated ptx to ptxas, a closed-source optimizing assembler which transforms the ptx to SASS, the device-specific machine-code. The SASS ISA is not public, but a disassembler ships with the CUDA toolkit, so we can (more or less) understand the machine code. Nvidia GPUs execute threads in groups of 32, called "warps". All threads in a warp execute in lockstep (so, if you will, the hardware is a 32-wide SIMD unit, and a warp is basically a hardware thread). When a warp encounters a conditional branch, some threads in a warp may take one path, while others may take the other. When this occurs, the warp is said to "diverge". When the hardware executes a divergent branch, it chooses a direction to take, pushes the current IP onto a hardware stack, and continues executing, with the SIMD lanes (i.e., "threads") which didn't take this direction disabled. At some point, the hardware encounters a special instruction which tells it to reconverge. At this point it pops the hardware stack and executes the other side of the branch. When the second side reaches the special instruction, the warp reconverges, we reset the SIMD mask, and we continue running. (If you're looking for a detailed explanation of this mechanism, the best reference I've found is this paper, by Bialas and Strzelecki: http://arxiv.org/abs/1504.01650.) Warp divergence can have large performance implications, but for the most part programmers (and compiler developers) can ignore divergence for the purposes of correctness. However, some GPU instructions have observable differences if executed in a converged vs. a diverged state. It's therefore important that if the programmer thinks that a particular convergent instruction is executed in a convergent state that the compiler maintains this invariant, and doesn't change things so it ends up being executed in a divergent state. It's this invariant that ptxas appears to be breaking. == Details == Consider the following CUDA code. int *p = ... if (tid == 0) *p = 42; __syncthreads(); if (tid == 32 && *p != 42) asm("trap;"); The intent here is that thread 0 in warp 0 sets *p to 42, then all warps wait at the barrier. Then thread 0 in warp 1 checks that *p == 42. I'll use this example below, but I observe the same behavior if we do an intra-, rather than inter-, warp check: i.e., s/tid == 32/tid == 1/. Clang generates the following ptx for this snippet: ld.u32 %r9, [%rd2]; // Load tid setp.eq.s32 %p5, %r9, 0; // if tid == 0 goto LBB12_4, else goto LBB12_5. @%p5 bra LBB12_4; bra.uni LBB12_5; LBB12_4: mov.u32 %r6, 42; st.u32 [%rd3], %r6; LBB12_5: bar.sync 0; // __syncthreads() setp.ne.s32 %p6, %r9, 32; // if tid != 32 goto LBB12_11, else proceed. @%p6 bra LBB12_11; ld.u32 %r7, [%rd3]; setp.eq.s32 %p7, %r7, 42; // if p[0] == 42 goto LBB12_11, else proceed. @%p7 bra LBB12_11; trap; LBB12_11: // proceed normally As far as I can tell, there's nothing wrong with this. It's pretty similar to what nvcc outputs. When compiled witinin the appropriate context, this particular code traps with ptxas -O1, but does not trap at -O2. Here's the SASS at ptxas -O1. It's pretty much a literal translation of the ptx. ISETP.NE.AND P0, PT, R0, RZ, PT; /*1f28*/ @P0 BRA 0x1f50; // branch if tid != 0 MOV32I R0, 0x2a; ST.E [R4], R0; // p[0] = 42 /*1f50*/ BAR.SYNC 0x0; // __syncthreads() ISETP.NE.AND P0, PT, R0, 0x20, PT; @P0 BRA 0x1fd8; // branch if tid != 32 LD.E R0, [R4]; ISETP.EQ.AND P0, PT, R0, 0x2a, PT; // branch if p[0] == 42 @P0 BRA 0x1fd8; BPT.TRAP 0x1; /*1fd8*/ [proceed normally] At ptxas -O2, the only interesting difference I see is that the two instructions in the tid == 0 block are predicated, instead of being behind a branch. ISETP.NE.AND P0, PT, R4, RZ, PT; @!P0 MOV32I R5, 0x2a; // if tid == 0, R5 = 42 @!P0 ST.E [R10], R5; // if tid == 0, *p = R5. BAR.SYNC 0x0; // __syncthreads() [same as above] Again, the O1 code traps, and the O2 code does not. My theory, based on this behavior and on what I can gather about how the branch synchronization stack works, is that when we hit the conditional branch in the O1 code at 0x1f28, the warp diverges, we take the branch, and we keep running until we hit a .S instruction much later in our execution flow, which re-synchronizes the warp. In particular, we execute our bar.sync **in a divergent state**, which allows other warps to continue and read *p, which hasn't yet been set to 42. (Recall that bar.sync operates on warps, not threads.) The behavior is similar when we change the code to do an intra-warp check -- s/tid == 32/tid == 1/. Without a .S instruction before the bar.sync, the warp just keeps running in a divergent fashion. I have tested with ptxas 7.0, 7.5, and 8.0; I get the same behavior (trap vs. no trap) with all of them, although I've only checked the sass from ptxas 7.0. I've only tested on my Tesla K40c. == Discussion == Although the example above shows a difference in behavior with ptxas -O1 vs -O2, it is possible to concoct examples that exhibit the bad behavior at ptxas -O2. In fact one of the thrust [0] testcases fails at O2 with clang, apparently due to this bug (thrust uses the "if thread 0 write some state, then syncthreads" idiom in many places). I unfortunately haven't been able to come up with a good minimized testcase for this issue. With a simple testcase based only on the code above, ptxas outputs the appropriate .S instructions to resynchronize the warp. You need something more, but without the source code to ptxas, it's very hard to say what. It's possible ptxas is trying to insert the .S instruction in the right place, but we're able to confuse it *just enough* that it does the wrong thing. But the net result is that a convergent instruction placed after a divergent branch does not necessarily do the right thing. As far as I can tell, nvcc doesn't do anything special to avoid this issue. I've spoken with Justin Holewinski from nvidia, and he's said that we've rediscovered a longstanding, known issue in ptxas, and that he's not aware of a workaround. [0] https://github.com/thrust/thrust -- You are receiving this mail because: You are on the CC list for the bug.
_______________________________________________ llvm-bugs mailing list llvm-bugs@lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs