On 01/18/2018 12:40 AM, Cesar Philippidis wrote:
In PR83920, I encountered a nvptx bug where live predicate variables
were clobbered before their value was broadcasted.
Hi,
I've managed to reproduce the problem based on the description in the PR.
Apparently, there
were problems in certain version of the CUDA driver where the JIT would
generate wrong code for shfl broadcasts.
Correct. And there's a work around committed for the JIT problem, which
you refer to in the next line (without introducing it first).
The attached patch teaches
nvptx_single not to apply that workaround if the predicate register is live.
> Tom, does this patch look sane to you?
The fact that the cond register is live at the start of the from bb does
not mean that the register can't be set inside the bb.
Furthermore, the live info does not make a distinction between
live-for-lane-0 and line-for-warp. So, if the condition reg is not set
in the bb, but set only for lane-0 in a previous bb, then we still need
to initialize lanes 1-31.
So, I don't think this is the way to address this bug.
I'm not sure if it defeats the
purpose of your original patch.
In test cases mentioned above, it does.
Regardless, the live predicate registers
shouldn't be clobbered before they are used.
There is a bug in the workaround, that's correct.
I think the way to address it is using a tmp .pred reg like so:
...
{
.reg .u32 %x;
mov.u32 %x,%tid.x;
setp.ne.u32 %rnotvzero,%x,0;
}
{
.reg .pred %rcond2;
setp.eq.u32 %rcond2, 1, 0; // workaround
@%rnotvzero bra Lskip;
...
setp.<op>.<type> %rcond,op1,op2; // could be here, could be earlier
mov.b1 %rcond2, %rcond; // used pseudo opcode mov.b1 for convenience
Lskip:
selp.u32 %rcondu32,1,0,%rcond2;
shfl.idx.b32 %rcondu32,%rcondu32,0,31;
setp.ne.u32 %rcond,%rcondu32,0;
}
...
Unfortunately, I cannot reproduce the runtime failure with gemm example
in the PR, so I didn't include it in the patch.
I'm managed to modify the test-case such that it reproduces the failure
with trunk (see PR). So, that test-case should be included.
Thanks,
- Tom
However, this patch does
fix the failure with da-1.c in og7.
This patch does not cause any
regressions.
Is it OK for trunk?
Thanks,
Cesar
nvptx-jit-relax.diff
diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 55c7e3c..698c574 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -3957,6 +3957,7 @@ bb_first_real_insn (basic_block bb)
static void
nvptx_single (unsigned mask, basic_block from, basic_block to)
{
+ bitmap live = DF_LIVE_IN (from);
rtx_insn *head = BB_HEAD (from);
rtx_insn *tail = BB_END (to);
unsigned skip_mask = mask;
@@ -4126,8 +4127,9 @@ nvptx_single (unsigned mask, basic_block from,
basic_block to)
There is nothing in the PTX spec to suggest that this is wrong, or
to explain why the extra initialization is needed. So, we classify
it as a JIT bug, and the extra initialization as workaround. */
- emit_insn_before (gen_movbi (pvar, const0_rtx),
- bb_first_real_insn (from));
+ if (!bitmap_bit_p (live, REGNO (pvar)))
+ emit_insn_before (gen_movbi (pvar, const0_rtx),
+ bb_first_real_insn (from));
#endif
emit_insn_before (nvptx_gen_vcast (pvar), tail);
}