gen7: split instructions that run into exec masking bugs

Francisco Jerez Tue, 12 Jul 2016 12:02:48 -0700

Iago Toral Quiroga <ito...@igalia.com> writes:

> In fp64 we can produce code like this:
>
> mov(16) vgrf2<2>:UD, vgrf3<2>:UD
>
> That our simd lowering pass would typically split in instructions with a
> width of 8, writing to two consecutive registers each. Unfortunately, gen7
> hardware has a bug affecting execution masking and as a result, the
> second GRF register write won't work properly. Curro verified this:
>
> "The problem is that pre-Gen8 EUs are hardwired to use the QtrCtrl+1
>  (where QtrCtrl is the 8-bit quarter of the execution mask signals
>  specified in the instruction control fields) for the second
>  compressed half of any single-precision instruction (for
>  double-precision instructions it's hardwired to use NibCtrl+1,
>  at least on HSW), which means that the EU will apply the wrong
>  execution controls for the second sequential GRF write if the number
>  of channels per GRF is not exactly eight in single-precision mode (or
>  four in double-float mode)."
>
> In practice, this means that we cannot write more than one
> consecutive GRF in a single instruction if the number of channels
> per GRF is not exactly eight in single-precision mode (or four
> in double-float mode).
>
> This patch makes our SIMD lowering pass split this kind of instructions
> so that the split versions only write to a single register. In the
> example above this means that we split the write in 4 instructions, each
> one writing 4 UD elements (width = 4) to a single register.
>
> v2 (Curro):
>  - Make explicit that the thing about hardwiring NibCtrl+1 for the second
>    compressed half is known to happen in Haswell and the issue with IVB
>    might not be exactly the same.
>  - Assign max_width instead of returning early so that we can handle
>    multiple restrictions affecting to the same instruction.
>  - Avoid division by 0 if the instruction does not write any registers.
>  - Ignore instructions what have WE_all set.
>  - Use the instruction execution type size instead of the dst type size.
>
> v3 (Curro):
>  - Move the implementation down so it is not placed in the middle of another
>    workaround.
>  - Declare channels_per_grf as const.
>  - Don't break the loop early if we find a BAD_FILE source.
>  - Fix the number of channels that the hardware shifts for the second half
>    of a compressed instruction to be 8 in single precision and 4 in double
>    precision.
> ---
>  src/mesa/drivers/dri/i965/brw_fs.cpp | 29 +++++++++++++++++++++++++++++
>  1 file changed, 29 insertions(+)
>
> diff --git a/src/mesa/drivers/dri/i965/brw_fs.cpp 
> b/src/mesa/drivers/dri/i965/brw_fs.cpp
> index 2f473cc..6ed98f5 100644
> --- a/src/mesa/drivers/dri/i965/brw_fs.cpp
> +++ b/src/mesa/drivers/dri/i965/brw_fs.cpp
> @@ -4755,6 +4755,35 @@ get_fpu_lowered_simd_width(const struct 
> brw_device_info *devinfo,
>     if (inst->is_3src(devinfo) && !devinfo->supports_simd16_3src)
>        max_width = MIN2(max_width, inst->exec_size / reg_count);
>  
> +   /* Pre-Gen8 EUs are hardwired to use the QtrCtrl+1 (where QtrCtrl is
> +    * the 8-bit quarter of the execution mask signals specified in the
> +    * instruction control fields) for the second compressed half of any
> +    * single-precision instruction (for double-precision instructions
> +    * it's hardwired to use NibCtrl+1, at least on HSW), which means that
> +    * the EU will apply the wrong execution controls for the second
> +    * sequential GRF write if the number of channels per GRF is not exactly
> +    * eight in single-precision mode (or four in double-float mode).
> +    *
> +    * In this situation we calculate the maximum size of the split
> +    * instructions so they only ever write to a single register.
> +    */
> +   if (devinfo->gen < 8 && inst->regs_written > 1 &&
> +       !inst->force_writemask_all) {
> +      const unsigned channels_per_grf = inst->exec_size / inst->regs_written;
> +      unsigned exec_type_size = 0;
> +      for (int i = 0; i < inst->sources; i++) {
> +         if (inst->src[i].file != BAD_FILE)
> +            exec_type_size = MAX2(exec_type_size, 
> type_sz(inst->src[i].type));
> +      }
> +      assert(exec_type_size);
> +
> +      /* The hardware shifts exactly 8 channels per compressed half of the
> +       * instruction in single-precision mode and exactly 4 in 
> double-precision.
> +       */
> +      if (channels_per_grf != (exec_type_size == 8 ? 4 : 8))
> +         max_width = MIN2(max_width, channels_per_grf);
> +   }
> +


Thanks,

Reviewed-by: Francisco Jerez <curroje...@riseup.net>

>     /* Only power-of-two execution sizes are representable in the instruction
>      * control fields.
>      */
> -- 
> 2.7.4

signature.asc
Description: PGP signature

_______________________________________________
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev

Re: [Mesa-dev] [PATCH v3] i965/fs/gen7: split instructions that run into exec masking bugs

Reply via email to