Samuel Iglesias Gonsálvez <sigles...@igalia.com> writes: > From: Iago Toral Quiroga <ito...@igalia.com> > > In fp64 we can produce code like this: > > mov(16) vgrf2<2>:UD, vgrf3<2>:UD > > That our simd lowering pass would typically split in instructions with a > width of 8, writing to two consecutive registers each. Unfortunately, gen7 > hardware has a bug affecting execution masking and as a result, the > second GRF register write won't work properly. Curro verified this: > > "The problem is that pre-Gen8 EUs are hardwired to use the QtrCtrl+1 > (where QtrCtrl is the 8-bit quarter of the execution mask signals > specified in the instruction control fields) for the second > compressed half of any single-precision instruction (for > double-precision instructions it's hardwired to use NibCtrl+1, > at least on HSW), which means that the EU will apply the wrong > execution controls for the second sequential GRF write if the number > of channels per GRF is not exactly eight in single-precision mode (or > four in double-float mode)." > > In practice, this means that we cannot write more than one > consecutive GRF in a single instruction if the number of channels > per GRF is not exactly eight in single-precision mode (or four > in double-float mode). > > This patch makes our SIMD lowering pass split this kind of instructions > so that the split versions only write to a single register. In the > example above this means that we split the write in 4 instructions, each > one writing 4 UD elements (width = 4) to a single register. > > v2 (Curro): > - Make explicit that the thing about hardwiring NibCtrl+1 for the second > compressed half is known to happen in Haswell and the issue with IVB > might not be exactly the same. > - Assign max_width instead of returning early so that we can handle > multiple restrictions affecting to the same instruction. > - Avoid division by 0 if the instruction does not write any registers. > - Ignore instructions what have WE_all set. > - Use the instruction execution type size instead of the dst type size. > --- > src/mesa/drivers/dri/i965/brw_fs.cpp | 28 ++++++++++++++++++++++++++++ > 1 file changed, 28 insertions(+) > > diff --git a/src/mesa/drivers/dri/i965/brw_fs.cpp > b/src/mesa/drivers/dri/i965/brw_fs.cpp > index 2f473cc..4d57412 100644 > --- a/src/mesa/drivers/dri/i965/brw_fs.cpp > +++ b/src/mesa/drivers/dri/i965/brw_fs.cpp > @@ -4691,6 +4691,34 @@ get_fpu_lowered_simd_width(const struct > brw_device_info *devinfo, > */ > unsigned reg_count = inst->regs_written; >
You've put this right in the middle of one of my previous workarounds ;), can you move it down a bit more next to the "According to the IVB PRMs" line, or close to the end of the function? > + /* Pre-Gen8 EUs are hardwired to use the QtrCtrl+1 (where QtrCtrl is > + * the 8-bit quarter of the execution mask signals specified in the > + * instruction control fields) for the second compressed half of any > + * single-precision instruction (for double-precision instructions > + * it's hardwired to use NibCtrl+1, at least on HSW), which means that > + * the EU will apply the wrong execution controls for the second > + * sequential GRF write if the number of channels per GRF is not exactly > + * eight in single-precision mode (or four in double-float mode). > + * > + * In this situation we calculate the maximum size of the split > + * instructions so they only ever write to a single register. > + */ > + if (devinfo->gen < 8 && inst->regs_written > 1 && > + !inst->force_writemask_all) { > + unsigned channels_per_grf = inst->exec_size / inst->regs_written; Could be declared const. > + unsigned exec_type_size = 0; > + for (int i = 0; i < inst->sources; i++) { > + if (inst->src[i].file == BAD_FILE) > + break; It wouldn't be right to break early if the instruction has any valid sources after a non-present one. This should probably be: | if (inst->src[i].file != BAD_FILE) | exec_type_size = MAX2(exec_type_size, type_sz(inst->src[i].type)); instead. > + exec_type_size = MAX2(exec_type_size, type_sz(inst->src[i].type)); > + } > + assert(exec_type_size); > + > + if (channels_per_grf != REG_SIZE / exec_type_size) { I think you really need to use (exec_type_size == 8 ? 4 : 8) instead of the RHS of this expression. The hardware shifts exactly 8 channels per compressed half of the instruction regardless of the execution type, so this formula would give you an incorrect answer for exec_type_size < 4. > + max_width = MIN2(max_width, channels_per_grf); > + } Redundant braces. > + } > + > for (unsigned i = 0; i < inst->sources; i++) > reg_count = MAX2(reg_count, (unsigned)inst->regs_read(i)); > > -- > 2.7.4 > > _______________________________________________ > mesa-dev mailing list > mesa-dev@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/mesa-dev
signature.asc
Description: PGP signature
_______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev