On Mon, 2016-07-11 at 12:19 -0700, Francisco Jerez wrote: > Samuel Iglesias Gonsálvez <sigles...@igalia.com> writes: > > > > > From: Iago Toral Quiroga <ito...@igalia.com> > > > > In fp64 we can produce code like this: > > > > mov(16) vgrf2<2>:UD, vgrf3<2>:UD > > > > That our simd lowering pass would typically split in instructions > > with a > > width of 8, writing to two consecutive registers each. > > Unfortunately, gen7 > > hardware has a bug affecting execution masking and as a result, the > > second GRF register write won't work properly. Curro verified this: > > > > "The problem is that pre-Gen8 EUs are hardwired to use the > > QtrCtrl+1 > > (where QtrCtrl is the 8-bit quarter of the execution mask signals > > specified in the instruction control fields) for the second > > compressed half of any single-precision instruction (for > > double-precision instructions it's hardwired to use NibCtrl+1, > > at least on HSW), which means that the EU will apply the wrong > > execution controls for the second sequential GRF write if the > > number > > of channels per GRF is not exactly eight in single-precision mode > > (or > > four in double-float mode)." > > > > In practice, this means that we cannot write more than one > > consecutive GRF in a single instruction if the number of channels > > per GRF is not exactly eight in single-precision mode (or four > > in double-float mode). > > > > This patch makes our SIMD lowering pass split this kind of > > instructions > > so that the split versions only write to a single register. In the > > example above this means that we split the write in 4 instructions, > > each > > one writing 4 UD elements (width = 4) to a single register. > > > > v2 (Curro): > > - Make explicit that the thing about hardwiring NibCtrl+1 for the > > second > > compressed half is known to happen in Haswell and the issue with > > IVB > > might not be exactly the same. > > - Assign max_width instead of returning early so that we can > > handle > > multiple restrictions affecting to the same instruction. > > - Avoid division by 0 if the instruction does not write any > > registers. > > - Ignore instructions what have WE_all set. > > - Use the instruction execution type size instead of the dst type > > size. > > --- > > src/mesa/drivers/dri/i965/brw_fs.cpp | 28 > > ++++++++++++++++++++++++++++ > > 1 file changed, 28 insertions(+) > > > > diff --git a/src/mesa/drivers/dri/i965/brw_fs.cpp > > b/src/mesa/drivers/dri/i965/brw_fs.cpp > > index 2f473cc..4d57412 100644 > > --- a/src/mesa/drivers/dri/i965/brw_fs.cpp > > +++ b/src/mesa/drivers/dri/i965/brw_fs.cpp > > @@ -4691,6 +4691,34 @@ get_fpu_lowered_simd_width(const struct > > brw_device_info *devinfo, > > */ > > unsigned reg_count = inst->regs_written; > > > You've put this right in the middle of one of my previous workarounds > ;), can you move it down a bit more next to the "According to the IVB > PRMs" line, or close to the end of the function?
Oh, sure. > > > > + /* Pre-Gen8 EUs are hardwired to use the QtrCtrl+1 (where > > QtrCtrl is > > + * the 8-bit quarter of the execution mask signals specified in > > the > > + * instruction control fields) for the second compressed half > > of any > > + * single-precision instruction (for double-precision > > instructions > > + * it's hardwired to use NibCtrl+1, at least on HSW), which > > means that > > + * the EU will apply the wrong execution controls for the > > second > > + * sequential GRF write if the number of channels per GRF is > > not exactly > > + * eight in single-precision mode (or four in double-float > > mode). > > + * > > + * In this situation we calculate the maximum size of the split > > + * instructions so they only ever write to a single register. > > + */ > > + if (devinfo->gen < 8 && inst->regs_written > 1 && > > + !inst->force_writemask_all) { > > + unsigned channels_per_grf = inst->exec_size / inst- > > >regs_written; > Could be declared const. Ok. > > > > + unsigned exec_type_size = 0; > > + for (int i = 0; i < inst->sources; i++) { > > + if (inst->src[i].file == BAD_FILE) > > + break; > It wouldn't be right to break early if the instruction has any valid > sources after a non-present one. This should probably be: I thought that couldn't happen. Will fix it. > > > > if (inst->src[i].file != BAD_FILE) > > exec_type_size = MAX2(exec_type_size, type_sz(inst- > > >src[i].type)); > instead. > > > > > + exec_type_size = MAX2(exec_type_size, type_sz(inst- > > >src[i].type)); > > + } > > + assert(exec_type_size); > > + > > + if (channels_per_grf != REG_SIZE / exec_type_size) { > I think you really need to use (exec_type_size == 8 ? 4 : 8) instead > of > the RHS of this expression. The hardware shifts exactly 8 channels > per > compressed half of the instruction regardless of the execution type, > so > this formula would give you an incorrect answer for exec_type_size < > 4. Ok, so it really has two strict working modes. Will fix this too. Thanks! > > > > + max_width = MIN2(max_width, channels_per_grf); > > + } > Redundant braces. > > > > > + } > > + > > for (unsigned i = 0; i < inst->sources; i++) > > reg_count = MAX2(reg_count, (unsigned)inst->regs_read(i)); > > > > _______________________________________________ > > mesa-dev mailing list > > mesa-dev@lists.freedesktop.org > > https://lists.freedesktop.org/mailman/listinfo/mesa-dev _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev