Francisco Jerez <curroje...@riseup.net> writes: > Samuel Iglesias Gonsálvez <sigles...@igalia.com> writes: > >> From: Iago Toral Quiroga <ito...@igalia.com> >> >> In fp64 we can produce code like this: >> >> mov(16) vgrf2<2>:UD, vgrf3<2>:UD >> >> That our simd lowering pass would typically split in instructions with a >> width of 8, writing to two consecutive registers each. Unfortunately, gen7 >> hardware has a bug affecting execution masking and as a result, the >> second GRF register write won't work properly. Curro verified this: >> >> "The problem is that pre-Gen8 EUs are hardwired to use the QtrCtrl+1 >> (where QtrCtrl is the 8-bit quarter of the execution mask signals >> specified in the instruction control fields) for the second >> compressed half of any single-precision instruction (for >> double-precision instructions it's hardwired to use NibCtrl+1, >> at least on HSW), which means that the EU will apply the wrong >> execution controls for the second sequential GRF write if the number >> of channels per GRF is not exactly eight in single-precision mode (or >> four in double-float mode)." >> >> In practice, this means that we cannot write more than one >> consecutive GRF in a single instruction if the number of channels >> per GRF is not exactly eight in single-precision mode (or four >> in double-float mode). >> >> This patch makes our SIMD lowering pass split this kind of instructions >> so that the split versions only write to a single register. In the >> example above this means that we split the write in 4 instructions, each >> one writing 4 UD elements (width = 4) to a single register. >> >> v2 (Curro): >> - Make explicit that the thing about hardwiring NibCtrl+1 for the second >> compressed half is known to happen in Haswell and the issue with IVB >> might not be exactly the same. >> - Assign max_width instead of returning early so that we can handle >> multiple restrictions affecting to the same instruction. >> - Avoid division by 0 if the instruction does not write any registers. >> - Ignore instructions what have WE_all set. >> - Use the instruction execution type size instead of the dst type size. >> --- >> src/mesa/drivers/dri/i965/brw_fs.cpp | 28 ++++++++++++++++++++++++++++ >> 1 file changed, 28 insertions(+) >> >> diff --git a/src/mesa/drivers/dri/i965/brw_fs.cpp >> b/src/mesa/drivers/dri/i965/brw_fs.cpp >> index 2f473cc..4d57412 100644 >> --- a/src/mesa/drivers/dri/i965/brw_fs.cpp >> +++ b/src/mesa/drivers/dri/i965/brw_fs.cpp >> @@ -4691,6 +4691,34 @@ get_fpu_lowered_simd_width(const struct >> brw_device_info *devinfo, >> */ >> unsigned reg_count = inst->regs_written; >> > > You've put this right in the middle of one of my previous workarounds > ;), can you move it down a bit more next to the "According to the IVB > PRMs" line, or close to the end of the function? > >> + /* Pre-Gen8 EUs are hardwired to use the QtrCtrl+1 (where QtrCtrl is >> + * the 8-bit quarter of the execution mask signals specified in the >> + * instruction control fields) for the second compressed half of any >> + * single-precision instruction (for double-precision instructions >> + * it's hardwired to use NibCtrl+1, at least on HSW), which means that >> + * the EU will apply the wrong execution controls for the second >> + * sequential GRF write if the number of channels per GRF is not exactly >> + * eight in single-precision mode (or four in double-float mode). >> + * >> + * In this situation we calculate the maximum size of the split >> + * instructions so they only ever write to a single register. >> + */ >> + if (devinfo->gen < 8 && inst->regs_written > 1 && >> + !inst->force_writemask_all) { >> + unsigned channels_per_grf = inst->exec_size / inst->regs_written; > > Could be declared const. > >> + unsigned exec_type_size = 0; >> + for (int i = 0; i < inst->sources; i++) { >> + if (inst->src[i].file == BAD_FILE) >> + break; > > It wouldn't be right to break early if the instruction has any valid > sources after a non-present one. This should probably be: > > | if (inst->src[i].file != BAD_FILE) > | exec_type_size = MAX2(exec_type_size, > type_sz(inst->src[i].type)); > > instead. > >> + exec_type_size = MAX2(exec_type_size, type_sz(inst->src[i].type)); >> + } >> + assert(exec_type_size); >> + >> + if (channels_per_grf != REG_SIZE / exec_type_size) { > > I think you really need to use (exec_type_size == 8 ? 4 : 8) instead of > the RHS of this expression. The hardware shifts exactly 8 channels per > compressed half of the instruction regardless of the execution type,
(for execution types other than DF that is) > so > this formula would give you an incorrect answer for exec_type_size < 4. > >> + max_width = MIN2(max_width, channels_per_grf); >> + } > > Redundant braces. > >> + } >> + >> for (unsigned i = 0; i < inst->sources; i++) >> reg_count = MAX2(reg_count, (unsigned)inst->regs_read(i)); >> >> -- >> 2.7.4 >> >> _______________________________________________ >> mesa-dev mailing list >> mesa-dev@lists.freedesktop.org >> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
signature.asc
Description: PGP signature
_______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev