fs: add lowering step to duplicate sources with stride 0.

Samuel Iglesias Gonsálvez Wed, 11 Jan 2017 00:17:59 -0800

On Tue, 2017-01-10 at 22:07 -0800, Francisco Jerez wrote:
> Hi Matt,
> 
> Matt Turner <matts...@gmail.com> writes:
> 
> > On Sun, Jan 8, 2017 at 10:53 PM, Matt Turner <matts...@gmail.com>
> > wrote:
> > > On 01/05, Samuel Iglesias Gonsálvez wrote:
> > > > 
> > > > From: "Juan A. Suarez Romero" <jasua...@igalia.com>
> > > > 
> > > > When dealing with DF uniforms with just 1 component, we set
> > > > stride 0 to
> > > > use the value along the operation. However, when duplicating
> > > > the
> > > > regioning parameters in IVB/VLV, we are violating the regioning
> > > > restrictions.
> > > > 
> > > > So instead of using the value with stride 0, we just duplicate
> > > > it in a
> > > > register, and then use the register instead, avoiding a DF with
> > > > stride 0.
> > > > ---
> > > > src/mesa/drivers/dri/i965/brw_fs.cpp | 63
> > > > ++++++++++++++++++++++++++++++++++++
> > > > src/mesa/drivers/dri/i965/brw_fs.h   |  1 +
> > > > 2 files changed, 64 insertions(+)
> > > > 
> > > > diff --git a/src/mesa/drivers/dri/i965/brw_fs.cpp
> > > > b/src/mesa/drivers/dri/i965/brw_fs.cpp
> > > > index eb3b4aa..78f2124 100644
> > > > --- a/src/mesa/drivers/dri/i965/brw_fs.cpp
> > > > +++ b/src/mesa/drivers/dri/i965/brw_fs.cpp
> > > > @@ -2168,6 +2168,62 @@ fs_visitor::lower_constant_loads()
> > > >    invalidate_live_intervals();
> > > > }
> > > > 
> > > > +/**
> > > > + * When dealing with double-precision floats (DF) in IVB/VLV,
> > > > we need to
> > > > + * duplicate the regioning parameters. This means that for a
> > > > DF scalar
> > > > + * (regioning <0,1,0>) we will end up using regioning <0,2,0>.
> > > > But
> > > > according
> > > > + * to General Restrictions on Regioning Parameters (Ivy PRM,
> > > > Vol. 4 Part
> > > > 3,
> > > > + * page 69), if VertStride = HorzStride = 0, Width must be 1
> > > > regardless
> > > > of the
> > > > + * value of ExecSize. So we would be violating the
> > > > restriction. To
> > > > overcome
> > > > + * it, this lowering step duplicates the scalar in a couple of
> > > > registers,
> > > > + * reading it as two floats to avoid the restriction.
> > > 
> > > 
> > > Huh, I would have thought that a <0,1,0>DF region would have done
> > > what
> > > we wanted, without the need to double any of the region
> > > parameters.
> > > 
> > > I haven't tested yet, so I'll play with it tomorrow and see if it
> > > blows
> > > up.
> > 
> > Indeed, it looks like gX<0,2,1>DF works for scalar sources.
> > 
> > The following patch seems to work for me in place of 04/22.
> > 
> > I would expect if we had an instruction like
> > 
> > add(16)  gX<1>DF  gY<8,8,1>DF  gZ<0,2,1>DF
> > 
> > that gX would write 2 registers and gY would read two registers,
> > but
> > gZ would read only one register -- and that would violate the
> > restriction that an instruction that writes two registers must also
> > read two registers.
> > 
> > But it seems that we never generate a DF instruction with
> > exec_size=16
> > because of
> > 
> > 4554       /* Note that in IVB/VLV for instructions that handles
> > DF,
> > we will duplicate
> > 4555        * the exec_size. So take this value for calculus
> > purposes.
> > 4556        */
> > 4557       unsigned exec_size = inst->exec_size;
> > 4558       if (devinfo->gen == 7 && !devinfo->is_haswell &&
> > inst->exec_data_size() == 8)
> > 4559          exec_size *= 2;
> > 
> 
> That's not the reason they aren't generated.  The hunk above had the
> opposite of the effect they intended, and didn't double the execution
> size value used by any regioning restriction calculation, it only
> raised
> the upper bound that the resulting lowered execution size is clamped
> to.
> That's about half of the reason I NAK'ed the patch.
> 
> The reason they aren't generating any 16-wide double float
> instructions
> is that they would violate other regioning restrictions implemented
> in
> get_fpu_lowered_simd_width(), in particular the one that forces
> non-force_writemask_all double float instructions to SIMD4 due to the
> Gen7 exec mask decompression bug.
> 
> > ... which I think Curro NAK'd elsewhere.
> > 
> > I think what we want is to use 0,2,1 regions for scalar sources,
> > and
> > split exec_size=16 (i.e., 8 DFs) instructions when they use scalar
> > sources in order to avoid violating the restriction. Otherwise, we
> > should emit exec_size=16 DF instructions where possible.
> 
> Agreed -- And thanks for testing it out on real hardware.  We'll
> definitely need a new restriction check in
> get_fpu_lowered_simd_width()
> that clamps to 4 the execution size of any instruction with a DF
> scalar
> source on pre-HSW hardware for the patch below to be fully correct.
> That way the SIMD lowering pass will get rid of any compressed
> instructions with scalar source (whether force_writemask_all or not)
> and
> we avoid tickling the Gen7 region decompression bug.
>


OK, I'll apply Matt's patch and this change too.

Thanks Matt and Curro for the patch/suggestions!

Sam

> > From 67bab74148e110171b8f3a999dc1d79616013108 Mon Sep 17 00:00:00
> > 2001
> > From: Matt Turner <matts...@gmail.com>
> > Date: Tue, 10 Jan 2017 19:33:22 -0800
> > Subject: [PATCH] i965: Use <0,2,1> region for scalar DF sources on
> > IVB/BYT.
> > 
> > On HSW+, scalar DF sources can be accessed using the normal <0,1,0>
> > region, but on IVB and BYT DF regions must be programmed in terms
> > of
> > floats. A <0,2,1> region accomplishes this.
> > ---
> >  src/mesa/drivers/dri/i965/brw_eu_emit.c | 26 ++++++++++++++++++++-
> > -----
> >  1 file changed, 20 insertions(+), 6 deletions(-)
> > 
> > diff --git a/src/mesa/drivers/dri/i965/brw_eu_emit.c
> > b/src/mesa/drivers/dri/i965/brw_eu_emit.c
> > index 2843385..5f81b7a 100644
> > --- a/src/mesa/drivers/dri/i965/brw_eu_emit.c
> > +++ b/src/mesa/drivers/dri/i965/brw_eu_emit.c
> > @@ -495,9 +495,16 @@ brw_set_src0(struct brw_codegen *p, brw_inst
> > *inst, struct brw_reg reg)
> >              brw_inst_set_src0_width(devinfo, inst, BRW_WIDTH_1);
> >              brw_inst_set_src0_vstride(devinfo, inst,
> > BRW_VERTICAL_STRIDE_0);
> >      } else {
> > -            brw_inst_set_src0_hstride(devinfo, inst, reg.hstride);
> > -            brw_inst_set_src0_width(devinfo, inst, reg.width);
> > -            brw_inst_set_src0_vstride(devinfo, inst, reg.vstride);
> > +            if (devinfo->gen == 7 && !devinfo->is_haswell &&
> > +                reg.type == BRW_REGISTER_TYPE_DF &&
> > has_scalar_region(reg)) {
> > +               brw_inst_set_src0_vstride(devinfo, inst,
> > BRW_VERTICAL_STRIDE_0);
> > +               brw_inst_set_src0_width(devinfo, inst,
> > BRW_WIDTH_2);
> > +               brw_inst_set_src0_hstride(devinfo, inst,
> > BRW_HORIZONTAL_STRIDE_1);
> > +            } else {
> > +               brw_inst_set_src0_vstride(devinfo, inst,
> > reg.vstride);
> > +               brw_inst_set_src0_width(devinfo, inst, reg.width);
> > +               brw_inst_set_src0_hstride(devinfo, inst,
> > reg.hstride);
> > +            }
> >      }
> >        } else {
> >           brw_inst_set_src0_da16_swiz_x(devinfo, inst,
> > @@ -577,9 +584,16 @@ brw_set_src1(struct brw_codegen *p, brw_inst
> > *inst, struct brw_reg reg)
> >              brw_inst_set_src1_width(devinfo, inst, BRW_WIDTH_1);
> >              brw_inst_set_src1_vstride(devinfo, inst,
> > BRW_VERTICAL_STRIDE_0);
> >      } else {
> > -            brw_inst_set_src1_hstride(devinfo, inst, reg.hstride);
> > -            brw_inst_set_src1_width(devinfo, inst, reg.width);
> > -            brw_inst_set_src1_vstride(devinfo, inst, reg.vstride);
> > +            if (devinfo->gen == 7 && !devinfo->is_haswell &&
> > +                reg.type == BRW_REGISTER_TYPE_DF &&
> > has_scalar_region(reg)) {
> > +               brw_inst_set_src1_vstride(devinfo, inst,
> > BRW_VERTICAL_STRIDE_0);
> > +               brw_inst_set_src1_width(devinfo, inst,
> > BRW_WIDTH_2);
> > +               brw_inst_set_src1_hstride(devinfo, inst,
> > BRW_HORIZONTAL_STRIDE_1);
> > +            } else {
> > +               brw_inst_set_src1_vstride(devinfo, inst,
> > reg.vstride);
> > +               brw_inst_set_src1_width(devinfo, inst, reg.width);
> > +               brw_inst_set_src1_hstride(devinfo, inst,
> > reg.hstride);
> > +            }
> >      }
> >        } else {
> >           brw_inst_set_src1_da16_swiz_x(devinfo, inst,
> > -- 
> > 2.7.3

signature.asc
Description: This is a digitally signed message part

_______________________________________________
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev

Re: [Mesa-dev] [PATCH 04/22] i965/fs: add lowering step to duplicate sources with stride 0.

Reply via email to