gen7: split instructions that run into exec masking bugs

Francisco Jerez Mon, 11 Jul 2016 12:26:06 -0700

Francisco Jerez <curroje...@riseup.net> writes:

> Samuel Iglesias Gonsálvez <sigles...@igalia.com> writes:
>
>> From: Iago Toral Quiroga <ito...@igalia.com>
>>
>> In fp64 we can produce code like this:
>>
>> mov(16) vgrf2<2>:UD, vgrf3<2>:UD
>>
>> That our simd lowering pass would typically split in instructions with a
>> width of 8, writing to two consecutive registers each. Unfortunately, gen7
>> hardware has a bug affecting execution masking and as a result, the
>> second GRF register write won't work properly. Curro verified this:
>>
>> "The problem is that pre-Gen8 EUs are hardwired to use the QtrCtrl+1
>>  (where QtrCtrl is the 8-bit quarter of the execution mask signals
>>  specified in the instruction control fields) for the second
>>  compressed half of any single-precision instruction (for
>>  double-precision instructions it's hardwired to use NibCtrl+1,
>>  at least on HSW), which means that the EU will apply the wrong
>>  execution controls for the second sequential GRF write if the number
>>  of channels per GRF is not exactly eight in single-precision mode (or
>>  four in double-float mode)."
>>
>> In practice, this means that we cannot write more than one
>> consecutive GRF in a single instruction if the number of channels
>> per GRF is not exactly eight in single-precision mode (or four
>> in double-float mode).
>>
>> This patch makes our SIMD lowering pass split this kind of instructions
>> so that the split versions only write to a single register. In the
>> example above this means that we split the write in 4 instructions, each
>> one writing 4 UD elements (width = 4) to a single register.
>>
>> v2 (Curro):
>>  - Make explicit that the thing about hardwiring NibCtrl+1 for the second
>>    compressed half is known to happen in Haswell and the issue with IVB
>>    might not be exactly the same.
>>  - Assign max_width instead of returning early so that we can handle
>>    multiple restrictions affecting to the same instruction.
>>  - Avoid division by 0 if the instruction does not write any registers.
>>  - Ignore instructions what have WE_all set.
>>  - Use the instruction execution type size instead of the dst type size.
>> ---
>>  src/mesa/drivers/dri/i965/brw_fs.cpp | 28 ++++++++++++++++++++++++++++
>>  1 file changed, 28 insertions(+)
>>
>> diff --git a/src/mesa/drivers/dri/i965/brw_fs.cpp 
>> b/src/mesa/drivers/dri/i965/brw_fs.cpp
>> index 2f473cc..4d57412 100644
>> --- a/src/mesa/drivers/dri/i965/brw_fs.cpp
>> +++ b/src/mesa/drivers/dri/i965/brw_fs.cpp
>> @@ -4691,6 +4691,34 @@ get_fpu_lowered_simd_width(const struct 
>> brw_device_info *devinfo,
>>      */
>>     unsigned reg_count = inst->regs_written;
>>  
>
> You've put this right in the middle of one of my previous workarounds
> ;), can you move it down a bit more next to the "According to the IVB
> PRMs" line, or close to the end of the function?
>
>> +   /* Pre-Gen8 EUs are hardwired to use the QtrCtrl+1 (where QtrCtrl is
>> +    * the 8-bit quarter of the execution mask signals specified in the
>> +    * instruction control fields) for the second compressed half of any
>> +    * single-precision instruction (for double-precision instructions
>> +    * it's hardwired to use NibCtrl+1, at least on HSW), which means that
>> +    * the EU will apply the wrong execution controls for the second
>> +    * sequential GRF write if the number of channels per GRF is not exactly
>> +    * eight in single-precision mode (or four in double-float mode).
>> +    *
>> +    * In this situation we calculate the maximum size of the split
>> +    * instructions so they only ever write to a single register.
>> +    */
>> +   if (devinfo->gen < 8 && inst->regs_written > 1 &&
>> +       !inst->force_writemask_all) {
>> +      unsigned channels_per_grf = inst->exec_size / inst->regs_written;
>
> Could be declared const.
>
>> +      unsigned exec_type_size = 0;
>> +      for (int i = 0; i < inst->sources; i++) {
>> +         if (inst->src[i].file == BAD_FILE)
>> +            break;
>
> It wouldn't be right to break early if the instruction has any valid
> sources after a non-present one.  This should probably be:
>
> |           if (inst->src[i].file != BAD_FILE)
> |              exec_type_size = MAX2(exec_type_size, 
> type_sz(inst->src[i].type));
>
> instead.
>
>> +         exec_type_size = MAX2(exec_type_size, type_sz(inst->src[i].type));
>> +      }
>> +      assert(exec_type_size);
>> +
>> +      if (channels_per_grf != REG_SIZE / exec_type_size) {
>
> I think you really need to use (exec_type_size == 8 ? 4 : 8) instead of
> the RHS of this expression.  The hardware shifts exactly 8 channels per
> compressed half of the instruction regardless of the execution type,


(for execution types other than DF that is)

> so
> this formula would give you an incorrect answer for exec_type_size < 4.
>
>> +         max_width = MIN2(max_width, channels_per_grf);
>> +      }
>
> Redundant braces.
>
>> +   }
>> +
>>     for (unsigned i = 0; i < inst->sources; i++)
>>        reg_count = MAX2(reg_count, (unsigned)inst->regs_read(i));
>>  
>> -- 
>> 2.7.4
>>
>> _______________________________________________
>> mesa-dev mailing list
>> mesa-dev@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev

signature.asc
Description: PGP signature

_______________________________________________
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev

Re: [Mesa-dev] [PATCH v2 3/6] i965/fs/gen7: split instructions that run into exec masking bugs

Reply via email to