On Wed, 5 Jun 2024, Richard Biener wrote:

> On Tue, 4 Jun 2024, Richard Sandiford wrote:
> 
> > Richard Biener <rguent...@suse.de> writes:
> > > The following emulates classical interleaving for SLP load permutes
> > > that we are unlikely handling natively.  This is to handle cases
> > > where interleaving (or load/store-lanes) is the optimal choice for
> > > vectorizing even when we are doing that within SLP.  An example
> > > would be
> > >
> > > void foo (int * __restrict a, int * b)
> > > {
> > >   for (int i = 0; i < 16; ++i)
> > >     {
> > >       a[4*i + 0] = b[4*i + 0] * 3;
> > >       a[4*i + 1] = b[4*i + 1] + 3;
> > >       a[4*i + 2] = (b[4*i + 2] * 3 + 3);
> > >       a[4*i + 3] = b[4*i + 3] * 3;
> > >     }
> > > }
> > >
> > > where currently the SLP store is merging four single-lane SLP
> > > sub-graphs but none of the loads in it can be code-generated
> > > with V4SImode vectors and a VF of four as the permutes would need
> > > three vectors.
> > 
> > Nice!
> > 
> > > The patch introduces a lowering phase after SLP discovery but
> > > before SLP pattern recognition or permute optimization that
> > > analyzes all loads from the same dataref group and creates an
> > > interleaving scheme starting from an unpermuted load.
> > >
> > > What can be handled is quite restrictive, matching only a subset
> > > of the non-SLP interleaving cases (the power-of-two group size
> > > ones, in addition only cases without gaps).  The interleaving
> > > vectorization in addition can handle size 3 and 5 - but I am not
> > > sure if it's possible to do that in a VL agnostic way.  It
> > > should be still possible to set up the SLP graph in a way that
> > > a load-lane could be matched from SLP pattern recognition.
> > 
> > Yeah, I don't think it would be possible to decompose a 3- or
> > 5-lane grouped load into a series of VLA 2-input permutes.
> > But (as I think you're saying) it seems like a load-3-lanes would just
> > be a load with a LANE_PERMUTATION of N, N+3, N+6, N+9, ... for lane N.
> > Is that right?
> 
> Yes, that's how it looks without this patch.  I think we'd need
> a load node loading N, N+1, N+2, ... and then permute nodes
> with N, N+3, ... and N+1, N+4, ... and N+2, N+5 ... so we generate
> one .LOAD_LANES from the load node and the permutes pick up the
> correct vector defs?  I'm not sure yet how classification and
> code generation would work for this.
> 
> The store side is already on trunk with the single SLP store node
> getting lanes via permutes.
> 
> It might be we want a load/store node with N inputs/outputs as the
> best representation and use lane_permutation to indicate the
> input (for stores) and output (for loads) "permute".
> 
> > > As said gaps are currently not handled - for SLP we have a
> > > representational issue that SLP_TREE_SCALAR_STMTS for "gap lanes"
> > > would need to be filled in some way (even if we just push NULL).
> > >
> > > The patch misses multi-level even/odd handling as well as CSEing
> > > intermediate generated permutes.  Both is quite straight-forward
> > > to add, but eventually there's a better or more general strategy
> > > for lowering?  The main goal of the patch is to avoid falling
> > > back to non-SLP for cases the interleaving code handles.
> > 
> > Does the multi-level thing including examples like:
> > 
> > int a[2 * 16];
> > int b[8 * 16];
> > void f()
> > {
> >   for (int i = 0; i < 16; ++i)
> >     {
> >       a[i * 2 + 0] += b[i * 8 + 0] + b[i * 8 + 1] + b[i * 8 + 2] + b[i * 8 
> > + 3];
> >       a[i * 2 + 1] += b[i * 8 + 4] + b[i * 8 + 5] + b[i * 8 + 6] + b[i * 8 
> > + 7];
> >     }
> > }
> > 
> > ?  For that we generate:
> > 
> >   _45 = VEC_PERM_EXPR <vect__4.9_63, vect__4.10_61, { 1, 3, 5, 7 }>;
> >   _44 = VEC_PERM_EXPR <vect__4.11_59, vect__4.12_57, { 1, 3, 5, 7 }>;
> >   _43 = VEC_PERM_EXPR <_45, _44, { 1, 3, 5, 7 }>;
> >   _49 = VEC_PERM_EXPR <vect__4.9_63, vect__4.10_61, { 0, 2, 4, 6 }>;
> >   _48 = VEC_PERM_EXPR <vect__4.11_59, vect__4.12_57, { 0, 2, 4, 6 }>;
> >   _47 = VEC_PERM_EXPR <_49, _48, { 1, 3, 5, 7 }>;
> >   _53 = VEC_PERM_EXPR <vect__4.9_63, vect__4.10_61, { 0, 1, 4, 5 }>;
> >   _52 = VEC_PERM_EXPR <vect__4.11_59, vect__4.12_57, { 0, 1, 4, 5 }>;
> >   _51 = VEC_PERM_EXPR <_53, _52, { 1, 3, 5, 7 }>;
> >   _54 = VEC_PERM_EXPR <_49, _48, { 0, 2, 4, 6 }>;
> > 
> > (two even level 1, one even level 2, one odd level 1), whereas
> > preferring 2xeven + 2xodd would avoid the third set of first-level
> > permutes:
> > 
> >   _45 = VEC_PERM_EXPR <vect__4.9_63, vect__4.10_61, { 1, 3, 5, 7 }>;
> >   _44 = VEC_PERM_EXPR <vect__4.11_59, vect__4.12_57, { 1, 3, 5, 7 }>;
> >   _43 = VEC_PERM_EXPR <_45, _44, { 1, 3, 5, 7 }>;
> >   _49 = VEC_PERM_EXPR <vect__4.9_63, vect__4.10_61, { 0, 2, 4, 6 }>;
> >   _48 = VEC_PERM_EXPR <vect__4.11_59, vect__4.12_57, { 0, 2, 4, 6 }>;
> >   _47 = VEC_PERM_EXPR <_49, _48, { 1, 3, 5, 7 }>;
> >   _51 = VEC_PERM_EXPR <_45, _44, { 0, 2, 4, 6 }>;
> >   _54 = VEC_PERM_EXPR <_49, _48, { 0, 2, 4, 6 }>;

Btw, I just checked GCC 14 which uses interleaving for this
(SLP needs a three vector permute) and that gets us double the VF
and overall 28 permutes compared to 10 permutes for the above
(and 8 for your optimal variant).  So it should already be an
improvement at least.

Richard.

Reply via email to