https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83920
--- Comment #8 from cesar at gcc dot gnu.org --- I tweaked your proposed fix as follows: diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c index 55c7e3cbf90..24625cd303f 100644 --- a/gcc/config/nvptx/nvptx.c +++ b/gcc/config/nvptx/nvptx.c @@ -4104,8 +4104,11 @@ nvptx_single (unsigned mask, basic_block from, basic_block to) mov.u32 %x,%tid.x; setp.ne.u32 %rnotvzero,%x,0; } + reg.pred %rcond2; // Scratch copy of the original rcond. + mov.pred %rcond2, %rcond; @%rnotvzero bra Lskip; + mov.pred %rcond, %rcond2 setp.<op>.<type> %rcond,op1,op2; Lskip: selp.u32 %rcondu32,1,0,%rcond; @@ -4126,8 +4129,11 @@ nvptx_single (unsigned mask, basic_block from, basic_block to) There is nothing in the PTX spec to suggest that this is wrong, or to explain why the extra initialization is needed. So, we classify it as a JIT bug, and the extra initialization as workaround. */ - emit_insn_before (gen_movbi (pvar, const0_rtx), - bb_first_real_insn (from)); + rtx_insn *from_insn = bb_first_real_insn (from); + rtx ptmp = gen_reg_rtx (GET_MODE (pvar)); + emit_insn_before (gen_rtx_SET (ptmp, pvar), from_insn); + emit_insn_before (gen_movbi (pvar, const0_rtx), from_insn); + emit_insn_before (gen_rtx_SET (pvar, ptmp), tail); #endif emit_insn_before (nvptx_gen_vcast (pvar), tail); } This generates the following assembly code for gemm.f90: $L34: $L11: mov.pred %r413, %r314; setp.eq.u32 %r314, 1, 0; @%r402 bra $L33; $L33: mov.pred %r314, %r413; selp.u32 %r414, 1, 0, %r314; shfl.idx.b32 %r414, %r414, 0, 31; setp.ne.u32 %r314, %r414, 0; @!%r314 bra.uni $L22; bra $L3; $L12: I'm not sure what's going on here, because this patch causes illegal memory access errors in lsdalton. Any thoughts? Maybe a more involved workaround would be to leave r314 alone, and use the scratch %r413 register as the predicate. But, then wouldn't the prevent the PRE code hoisting optimization which moved the computation for %r314 outside of the loop in the first place? Is this original PTX JIT bug still present in the current Nvidia drivers? You mentioned that this problem first appeared in 381.22. I wonder if it has been resolved in 387.