> -----Original Message-----
> From: Richard Biener <rguent...@suse.de>
> Sent: Tuesday, June 22, 2021 1:08 PM
> To: Tamar Christina <tamar.christ...@arm.com>
> Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>
> Subject: Re: [PATCH]middle-end[RFC] slp: new implementation of complex
> numbers
> 
> On Mon, 21 Jun 2021, Tamar Christina wrote:
> 
> > Hi Richi,
> >
> > This patch is still very much incomplete and I do know that it is
> > missing things but it's complete enough such that examples are working
> > and allows me to show what I'm working towards.
> >
> > note, that this approach will remove a lot of code in
> > tree-vect-slp-patterns but to keep the diff readable I've left them in
> > and just commented out the calls or removed them where needed.
> >
> > The patch rewrites the complex numbers detection by splitting the
> > detection of structure from dataflow analysis.  In principle the
> > biggest difference between this and the previous implementation is
> > that instead of trying to detect valid complex operations it *makes*
> > an operation a valid complex operation.
> >
> > To do this each operation gets a dual optab which matches the same
> > structure but has no dataflow requirement.
> >
> > i.e. in this patch I added 4, ADDSUB, SUBADD, MUL_ADDSUB,
> MULL_SUBADD.
> >
> > There is a then a mapping between these and their variant with the
> dataflow:
> >
> > * ADDSUB -> COMPLEX_ADD_ROT270
> > * SUBADD -> COMPLEX_ADD_ROT90
> > * MUL_ADDSUB -> COMPLEX_MUL_CONJ
> > * MUL_SUBADD -> COMPLEX_MUL
> >
> > with the intention that when we detect the structure of an operation
> > we query the backend for both optabs.
> >
> > This should result in one of three states:
> >
> >  * not supported: Move on.
> >  * Supports ADDSUB only: Rewrite using ADDSUB, set type to
> 'cannot_transform'
> >  * Supports COMPLEX_ADD_ROT270 only: Rewrite using ADDSUB, set type
> to 'must_transform'
> >  * Supports both: Rewrite using ADDSUB, set type fo 'can_transform'
> >
> > with the idea behind `can_transform` is to check the costs of the
> > inverse permute needed to use the complex operation and if this is
> > very expensive then stick to addsub.  This requires the target to be
> > able to cost the operations reasonably correct.
> >
> > So for ADD this looks like
> >
> >  === vect_match_slp_patterns ===
> >  Analyzing SLP tree 0x494e970 for patterns  Found ADDSUB pattern in
> > SLP tree  Target does not support ADDSUB for vector type vector(4)
> > float  Found COMPLEX_ADD_ROT270 pattern in SLP tree  Target supports
> > COMPLEX_ADD_ROT270 vectorization with mode vector(4) float Pattern
> > matched SLP tree node 0x494e970 (max_nunits=4, refcnt=1) op template:
> > REALPART_EXPR <*_10> = _23;
> >   stmt 0 REALPART_EXPR <*_10> = _23;
> >   stmt 1 IMAGPART_EXPR <*_10> = _22;
> >   children 0x494ea00
> > node 0x494ea00 (max_nunits=4, refcnt=1) op template: slp_patt_39 =
> > .ADDSUB (_23, _23);
> >   stmt 0 _23 = _6 + _13;
> >   stmt 1 _22 = _12 - _8;
> >   children 0x494eb20 0x494ebb0
> > node 0x494eb20 (max_nunits=4, refcnt=1) op template: _13 =
> > REALPART_EXPR <*_3>;
> >   stmt 0 _13 = REALPART_EXPR <*_3>;
> >   stmt 1 _12 = IMAGPART_EXPR <*_3>;
> > node 0x494ebb0 (max_nunits=4, refcnt=1)
> > op: VEC_PERM_EXPR
> >   { }
> >   lane permutation { 0[1] 0[0] }
> >   children 0x494ec40
> > node 0x494ec40 (max_nunits=1, refcnt=2) op template: _8 =
> > REALPART_EXPR <*_5>;
> >   stmt 0 _8 = REALPART_EXPR <*_5>;
> >   stmt 1 _6 = IMAGPART_EXPR <*_5>;
> >   load permutation { 0 1 }
> >
> > and later during optimize_slp we get
> >
> > Tranforming SLP expression from ADDSUB to COMPLEX_ADD_ROT270
> > processing node 0x494ebb0 simplifying permute node 0x494ebb0
> Optimized
> > SLP instances:
> > node 0x494e970 (max_nunits=4, refcnt=1) op template: REALPART_EXPR
> > <*_10> = _23;
> >    stmt 0 REALPART_EXPR <*_10> = _23;
> >    stmt 1 IMAGPART_EXPR <*_10> = _22;
> >    children 0x494ea00
> > node 0x494ea00 (max_nunits=4, refcnt=1) op template: slp_patt_39 =
> > .COMPLEX_ADD_ROT270 (_23, _23);
> >    stmt 0 _23 = _6 + _13;
> >    stmt 1 _22 = _12 - _8;
> >    children 0x494eb20 0x494ebb0
> > node 0x494eb20 (max_nunits=4, refcnt=1) op template: _13 =
> > REALPART_EXPR <*_3>;
> >    stmt 0 _13 = REALPART_EXPR <*_3>;
> >    stmt 1 _12 = IMAGPART_EXPR <*_3>;
> > node 0x494ebb0 (max_nunits=4, refcnt=1)
> > op: VEC_PERM_EXPR
> >    { }
> >    lane permutation { 0[0] 0[1] }
> >    children 0x494ec40
> > node 0x494ec40 (max_nunits=1, refcnt=2) op template: _8 =
> > REALPART_EXPR <*_5>;
> >    stmt 0 _8 = REALPART_EXPR <*_5>;
> >    stmt 1 _6 = IMAGPART_EXPR <*_5>;
> >
> > Now I still have to elide the VEC_PERM_EXPR here but that's easy.
> 
> So having skimmed half of the patch - this means SLP pattern recog will
> initially recognize { +, -, +, - } as ADDSUB for example but not factor in 
> lane
> permutes from loads yet.  Now, suppose we have { +, -, -, + } seen in pattern
> recog - how's that handled?

These will be rejected, the lanes are still checked to ensure that it's a { +, 
- ,+, -}.
The lane permutes are checked in unchanged code, namely

vect_detect_pair_op (slp_tree node, bool two_operands = true,...)

which only calls ::matches when the operation is consistent.

> It might feed operations where we'd like to have inputs permuted and thus
> would end up with ADDSUB in the end?

In principle we could allow this, if we were to do this then one way to handle
it would be to add a permute for ADDSUB as well to permute the lanes.

I think it's better in this scenario to have the pattern recog code insert the
permute early on then, in the ::build phase. This will then maintain correctness
in the SLP tree and the extra permute will be dealt with in optimize_slp's 
normal
working.

It becomes a bit tricky here though, since such cases are perhaps better 
handled by
Increasing the VF and using permuted loads (LOAD_LANES).

> 
> That said, you do
> 
> +         /* Check to see if this node can be transformed during permute
> +            materialization.  */
> +         bool patt_trans_cand_p = is_pattern_convert_candidate_p (node);
> +         if (patt_trans_cand_p)
> +           bitmap_set_bit (n_transform, idx);
> 
> in the propagation stage (but this looks like a static marking).

Yeah indeed, it's just a way to say "inspect these later".

> 
> And then just for each of those candidates you somehow "undo" the
> permutes on the SLP childs (by modifying it in-place - uh, see below).
> But if you figure one child that is not permuted you've done that anyway but
> stop.

Yeah, if the child is unpermuted, vect_reverse_perm returns the identity permute
which is then later elided.  In the final version this should just not set 
permute at all.

> 
> So what I've expected is that this "transform" is done during propagation
> (where you now set the n_transform bit), by means of detecting the
> candidate as materialization point (but I'd have to recap the kind of permutes
> we expect to be incoming).  That is, for a specific set of permutes on the
> successor edges claim the node can handle the permute and thus propagate
> unpermuted upwards.
> 
> At the materialization stage you'd then transform the node.

Yeah, that makes sense. I separated them out to have distinct "stages", but yeah
the loops can be combined. I'll do that.

> 
> Btw, how do we handle targets that can do complex_addsub but not addsub?
> 

That's why during pattern matching the optab code checks both. ADDSUB is 
temporarily
accepted and then the transform code MUST transform it. So ADDSUB cannot leave
optimize_slp is the operation is not supported.

This is what I meant with the three states (missing from this patch)

 * not supported: Move on.
 * Supports ADDSUB only: Rewrite using ADDSUB, set type to 'cannot_transform'
 * Supports COMPLEX_ADD_ROT270 only: Rewrite using ADDSUB, set type to 
'must_transform'
 * Supports both: Rewrite using ADDSUB, set type fo 'can_transform'

> > This works for ADD, but it doesn't work well when there's a
> > complicated sequence of loads.  So for example
> >
> > #define N 200
> > void g (double complex a[restrict N], double complex b[restrict N],
> >     double complex c[restrict N])
> > {
> >   for (int i=0; i < N; i++)
> >     {
> >       c[i] =  a[i] * b[i];
> >     }
> > }
> >
> > will results in an SLP tree where each operand of the multiply does
> > not get to see all the original vectors:
> >
> > Final SLP tree for instance 0x5678a30:
> > node 0x55965a0 (max_nunits=4, refcnt=2) op template: REALPART_EXPR
> > <*_7> = _25;
> >   stmt 0 REALPART_EXPR <*_7> = _25;
> >   stmt 1 IMAGPART_EXPR <*_7> = _26;
> >   children 0x5596630
> > node 0x5596630 (max_nunits=4, refcnt=2)
> > op: VEC_PERM_EXPR
> >   stmt 0 _25 = _17 - _22;
> >   stmt 1 _26 = _23 + _24;
> >   lane permutation { 0[0] 1[1] }
> >   children 0x5596a20 0x5596ab0
> > node 0x5596a20 (max_nunits=1, refcnt=1) op template: _25 = _17 - _22;
> >   { }
> >   children 0x55966c0 0x5596870
> > node 0x55966c0 (max_nunits=4, refcnt=3) op template: _17 = _10 * _19;
> >   stmt 0 _17 = _10 * _19;
> >   stmt 1 _23 = _10 * _18;
> >   children 0x5596750 0x55967e0
> > node 0x5596750 (max_nunits=4, refcnt=2) op template: _10 =
> > REALPART_EXPR <*_3>;
> >   stmt 0 _10 = REALPART_EXPR <*_3>;
> >   stmt 1 _10 = REALPART_EXPR <*_3>;
> >   load permutation { 0 0 }
> > node 0x55967e0 (max_nunits=4, refcnt=2) op template: _19 =
> > REALPART_EXPR <*_5>;
> >   stmt 0 _19 = REALPART_EXPR <*_5>;
> >   stmt 1 _18 = IMAGPART_EXPR <*_5>;
> >   load permutation { 0 1 }
> > node 0x5596870 (max_nunits=4, refcnt=3) op template: _22 = _9 * _18;
> >   stmt 0 _22 = _9 * _18;
> >   stmt 1 _24 = _9 * _19;
> >   children 0x5596900 0x5596990
> > node 0x5596900 (max_nunits=4, refcnt=2) op template: _9 =
> > IMAGPART_EXPR <*_3>;
> >   stmt 0 _9 = IMAGPART_EXPR <*_3>;
> >   stmt 1 _9 = IMAGPART_EXPR <*_3>;
> >   load permutation { 1 1 }
> > node 0x5596990 (max_nunits=4, refcnt=2) op template: _18 =
> > IMAGPART_EXPR <*_5>;
> >   stmt 0 _18 = IMAGPART_EXPR <*_5>;
> >   stmt 1 _19 = REALPART_EXPR <*_5>;
> >   load permutation { 1 0 }
> > node 0x5596ab0 (max_nunits=1, refcnt=1) op template: _26 = _23 + _24;
> >   { }
> >   children 0x55966c0 0x5596870
> >
> > So depending on the node each multiply either sees REAL 3 + REAL 5 +
> > IMAG 5 or IMAG 3 + IMAG 5 + REAL 5.
> >
> > since we are removing the TWO_OPERANDS node we need to drop one of
> the
> > multiply and so we need to give it the original 2 vectors as a
> > parameter.  The current implementation emits a permute operation to
> > combine the loads again and later pushes the permute down.  This is
> > problematic as it again requires us to do df analysis early.
> >
> > To counter this, in the patch I have changed loads to no longer come
> > out of build_slp with LOAD_PERMUTES but instead to have a VEC_PERM
> above each load.
> 
> Yep.  I've been there before (not sure if I ever sent you my work-in-progress
> here).  There's some wrongs in your patch but I guess doing this exactly for
> the permutes optimize_slp handles would be fine.

Yeah, it fell apart when testing x264 code above so I noticed I had some of the
lanes wrong, but didn't fix it yet as wasn't needed for my testcases.

> 
> We should see to do this independently of the stuff above, I can handle this
> and will prepare a patch for this later.
> 

Thanks! That's a big help!

> > This is only temporary and the idea is that optimize SLP will push
> > them back down into the loads once it's deemed there are still permutes
> needed.
> >
> > So COMPLEX MUL becomes:
> >
> > Final SLP tree for instance 0x46d9c80:
> > node 0x45f76c0 (max_nunits=4, refcnt=2) op template: REALPART_EXPR
> > <*_7> = _25;
> >   stmt 0 REALPART_EXPR <*_7> = _25;
> >   stmt 1 IMAGPART_EXPR <*_7> = _26;
> >   children 0x45f7750
> > node 0x45f7750 (max_nunits=4, refcnt=2)
> > op: VEC_PERM_EXPR
> >   stmt 0 _25 = _17 - _22;
> >   stmt 1 _26 = _23 + _24;
> >   lane permutation { 0[0] 1[1] }
> >   children 0x45f7bd0 0x45f7c60
> > node 0x45f7bd0 (max_nunits=1, refcnt=1) op template: _25 = _17 - _22;
> >   { }
> >   children 0x45f77e0 0x45f7a20
> > node 0x45f77e0 (max_nunits=4, refcnt=3) op template: _17 = _10 * _19;
> >   stmt 0 _17 = _10 * _19;
> >   stmt 1 _23 = _10 * _18;
> >   children 0x45f7870 0x45f7990
> > node 0x45f7870 (max_nunits=4, refcnt=2)
> > op: VEC_PERM_EXPR
> >   { }
> >   lane permutation { 0[0] 0[0] }
> >   children 0x45f7900
> > node 0x45f7900 (max_nunits=1, refcnt=4) op template: _10 =
> > REALPART_EXPR <*_3>;
> >   stmt 0 _10 = REALPART_EXPR <*_3>;
> >   stmt 1 _9 = IMAGPART_EXPR <*_3>;
> >   load permutation { 0 1 }
> > node 0x45f7990 (max_nunits=4, refcnt=3) op template: _19 =
> > REALPART_EXPR <*_5>;
> >   stmt 0 _19 = REALPART_EXPR <*_5>;
> >   stmt 1 _18 = IMAGPART_EXPR <*_5>;
> >   load permutation { 0 1 }
> > node 0x45f7a20 (max_nunits=4, refcnt=3) op template: _22 = _9 * _18;
> >   stmt 0 _22 = _9 * _18;
> >   stmt 1 _24 = _9 * _19;
> >   children 0x45f7ab0 0x45f7b40
> > node 0x45f7ab0 (max_nunits=4, refcnt=2)
> > op: VEC_PERM_EXPR
> >   { }
> >   lane permutation { 0[1] 0[1] }
> >   children 0x45f7900
> > node 0x45f7b40 (max_nunits=4, refcnt=2)
> > op: VEC_PERM_EXPR
> >   { }
> >   lane permutation { 0[1] 0[0] }
> >   children 0x45f7990
> > node 0x45f7c60 (max_nunits=1, refcnt=1) op template: _26 = _23 + _24;
> >   { }
> >   children 0x45f77e0 0x45f7a20
> >
> > This representation has a number of advantages:
> >
> >  * complex mul becomes trivial, as trivial as add.
> >  * add now detects itself as part of a sequence of complex mul.  i.e. if you
> >    have only add, it will use it instead of failing to follow the df like 
> > now.
> >  * It opens the door a bit more to vectorize the first loop in x264.  The 
> > code
> >    there has a cross iteration mix of loading the full vector but in an 
> > order
> >    that the vectorizer thinks it can't use a single linear load for.  e.g.
> >
> >    .. = (a[0] + b[0]) - (a[4] + b[4])
> >    .. = (a[1] + b[1]) - (a[5] + b[5])
> >    .. = (a[2] + b[2]) - (a[6] + b[6])
> >    .. = (a[3] + b[3]) - (a[7] + b[6])
> >
> >  Ultimately it thinks it's cheaper to use scalar loads here, while (at
> > least for
> >  aarch64) if we see the unrolled loop together we can pattern match
> > this sequence  and rewrite it to something that uses the linear load.
> > This is because the  addition and subtract are widening.  So the
> > widening arithmetic operations we  have already only read half the vector
> since they're widening.
> >
> >  For this to work, we need to see all 16 bytes loads, so the full
> > unrolled loop  of 4 iterations. Which currently doesn't happen for
> > NEON but does for SVE. (though  I wonder if now that you've added SLP
> > decomposition if the initial SLP build  shouldn't allow building of
> > SLP trees that are wider than the vector length and  later just
> > decompose it or bump up the VF.).  However this is something for  another
> discussion :).
> >
> >  Changing the loads like this makes things a lot simpler and we
> > require no  additional work anywhere else.  The complexity ends up in
> > optimize SLP which  now has the following changes:
> >
> >  * loads no longer introduce permutes, so they don't see the permutations.
> >    Instead each load can lead to a permute which can see the permutes.
> >  * As we're moving permutes up we stop at call boundaries as well.
> >  * As we're calculating materialization points for the permute we inspect
> the
> >    node and check if we can transform it, if so we mark it for 
> > transformation.
> >  * After calculating the materialization and transform points we iterate 
> > over
> >    the transform points checking whether we should transform.  If we
> should we
> >    modify the permute at the materialization point (which should be at the
> >    transform point) into the inverse permute.
> >  * During materialization as the permute is transformed this would undo
> the
> >    permutation if we transformed the instruction.
> >  * After computing this we elide the permute node, or push it down
> towards the
> >    load and then back into the load itself, splitting the node if required.
> >
> > After this the tree should be back to it's normal form.  I still need
> > to work out when and where one needs to push down the permute all the
> > way to the loads.
> 
> It's indeed the case that the current permute "propagation" is away from
> loads only and a phase to propagate in the reverse direction is missing if we
> now have permute sources other than loads.  The cycle handling of the
> propagation is also a bit awkward and incomplete (but it "works").
> 

The thing I'm also wondering about is if it's always better to push it back in,
e.g. if from a given load you have different permuted variants of the result
if it can't be beneficial to sometimes keep the permutes and do the load only
once.

Thanks,
Tamar

> Richard.
> 
> > But this covers my thoughts.  Any feedback is appreciated to get it
> > right early on :)
> >
> > Regards,
> > Tamar
> >
> > --- inline copy of patch --
> > diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def index
> >
> b2f414d2131b867eda337cd30f5ed40ed7c9fa10..73d1d742414d16e06e8312967
> 990
> > d565eddb218d 100644
> > --- a/gcc/internal-fn.def
> > +++ b/gcc/internal-fn.def
> > @@ -277,6 +277,10 @@ DEF_INTERNAL_FLT_FN (SCALB, ECF_CONST, scalb,
> > binary)  DEF_INTERNAL_FLT_FLOATN_FN (FMIN, ECF_CONST, fmin, binary)
> > DEF_INTERNAL_FLT_FLOATN_FN (FMAX, ECF_CONST, fmax, binary)
> > DEF_INTERNAL_OPTAB_FN (XORSIGN, ECF_CONST, xorsign, binary)
> > +DEF_INTERNAL_OPTAB_FN (ADDSUB, ECF_CONST, addsub, binary)
> > +DEF_INTERNAL_OPTAB_FN (SUBADD, ECF_CONST, subadd, binary)
> > +DEF_INTERNAL_OPTAB_FN (MUL_ADDSUB, ECF_CONST, mul_addsub,
> binary)
> > +DEF_INTERNAL_OPTAB_FN (MUL_SUBADD, ECF_CONST, mul_subadd,
> binary)
> >  DEF_INTERNAL_OPTAB_FN (COMPLEX_ADD_ROT90, ECF_CONST, cadd90,
> binary)
> > DEF_INTERNAL_OPTAB_FN (COMPLEX_ADD_ROT270, ECF_CONST,
> cadd270, binary)
> > DEF_INTERNAL_OPTAB_FN (COMPLEX_MUL, ECF_CONST, cmul, binary)
> diff
> > --git a/gcc/optabs.def b/gcc/optabs.def index
> >
> b192a9d070b8aa72e5676b2eaa020b5bdd7ffcc8..73d23c49bdfa7423f67a62406
> 50c
> > 558cfaf3f2f0 100644
> > --- a/gcc/optabs.def
> > +++ b/gcc/optabs.def
> > @@ -290,6 +290,10 @@ OPTAB_D (atan_optab, "atan$a2")  OPTAB_D
> > (atanh_optab, "atanh$a2")  OPTAB_D (copysign_optab, "copysign$F$a3")
> > OPTAB_D (xorsign_optab, "xorsign$F$a3")
> > +OPTAB_D (addsub_optab, "addsub$a3")
> > +OPTAB_D (subadd_optab, "subadd$a3")
> > +OPTAB_D (mul_addsub_optab, "mul_addsub$a3") OPTAB_D
> > +(mul_subadd_optab, "mul_subadd$a3")
> >  OPTAB_D (cadd90_optab, "cadd90$a3")
> >  OPTAB_D (cadd270_optab, "cadd270$a3")  OPTAB_D (cmul_optab,
> > "cmul$a3") diff --git a/gcc/tree-vect-slp-patterns.c
> > b/gcc/tree-vect-slp-patterns.c index
> >
> 2ed49cd9edcabd7948b365dd60d7405b79079a7b..ce8d2daa1f06f17eda2cb082
> 55d7
> > 6bac9cf263f0 100644
> > --- a/gcc/tree-vect-slp-patterns.c
> > +++ b/gcc/tree-vect-slp-patterns.c
> > @@ -68,6 +68,39 @@ along with GCC; see the file COPYING3.  If not see
> >   * vect_pattern class
> >
> >
> **********************************************************
> ************
> > ********/
> >
> > +static bool
> > +vect_pattern_validate_optab (internal_fn ifn, slp_tree node);
> > +
> > +
> > +/* Check if the base or complex equivalents are supported.  */ static
> > +bool vect_pattern_validate_optab_or_eq (internal_fn ifn, slp_tree
> > +node) {
> > +  if (vect_pattern_validate_optab (ifn, node))
> > +    return true;
> > +
> > +  switch (ifn)
> > +  {
> > +    case IFN_ADDSUB:
> > +      ifn = IFN_COMPLEX_ADD_ROT270;
> > +      break;
> > +    case IFN_SUBADD:
> > +      ifn = IFN_COMPLEX_ADD_ROT90;
> > +      break;
> > +    case IFN_MUL_ADDSUB:
> > +      ifn = IFN_COMPLEX_MUL_CONJ;
> > +      break;
> > +    case IFN_MUL_SUBADD:
> > +      ifn = IFN_COMPLEX_MUL;
> > +      break;
> > +    default:
> > +      gcc_unreachable ();
> > +  }
> > +
> > +  return vect_pattern_validate_optab (ifn, node); }
> > +
> > +
> >  /* Default implementation of recognize that performs matching, validation
> and
> >     replacement of nodes but that can be overriden if required.  */
> >
> > @@ -630,8 +663,7 @@ complex_add_pattern::build (vec_info *vinfo)
> >
> >    /* First re-arrange the children.  */
> >    SLP_TREE_CHILDREN (*this->m_node)[0] = children[0];
> > -  SLP_TREE_CHILDREN (*this->m_node)[1] =
> > -    vect_build_swap_evenodd_node (children[1]);
> > +  SLP_TREE_CHILDREN (*this->m_node)[1] = children[1];
> >
> >    SLP_TREE_REF_COUNT (SLP_TREE_CHILDREN (*this->m_node)[0])++;
> >    SLP_TREE_REF_COUNT (SLP_TREE_CHILDREN (*this->m_node)[1])++;
> @@
> > -670,9 +702,9 @@ complex_add_pattern::matches (complex_operation_t
> op,
> >        Rotation 0 and 180 can be handled by normal SIMD code, so we don't
> need
> >        to care about them here.  */
> >    if (op == MINUS_PLUS)
> > -    ifn = IFN_COMPLEX_ADD_ROT90;
> > +    ifn = IFN_SUBADD;
> >    else if (op == PLUS_MINUS)
> > -    ifn = IFN_COMPLEX_ADD_ROT270;
> > +    ifn = IFN_ADDSUB;
> >    else
> >      return ifn;
> >
> > @@ -680,17 +712,7 @@ complex_add_pattern::matches
> (complex_operation_t op,
> >       we support.  */
> >    gcc_assert (ops->length () == 2);
> >
> > -  vec<slp_tree> children = SLP_TREE_CHILDREN ((*ops)[0]);
> > -
> > -  /* First node must be unpermuted.  */
> > -  if (linear_loads_p (perm_cache, children[0]) != PERM_EVENODD)
> > -    return IFN_LAST;
> > -
> > -  /* Second node must be permuted.  */
> > -  if (linear_loads_p (perm_cache, children[1]) != PERM_ODDEVEN)
> > -    return IFN_LAST;
> > -
> > -  if (!vect_pattern_validate_optab (ifn, *node))
> > +  if (!vect_pattern_validate_optab_or_eq (ifn, *node))
> >      return IFN_LAST;
> >
> >    return ifn;
> > @@ -1501,7 +1523,8 @@ vect_pattern_decl_t slp_patterns[]
> >       order patterns from the largest to the smallest.  Especially if they
> >       overlap in what they can detect.  */
> >
> > -  SLP_PATTERN (complex_operations_pattern),
> > +  SLP_PATTERN (complex_add_pattern),
> > +  //SLP_PATTERN (complex_operations_pattern),
> >  };
> >  #undef SLP_PATTERN
> >
> > diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c index
> >
> 0c1f85beeb2e9f3fb7c66c15d4d30594b2570f9e..f24ad85d15780807c49153895b
> 73
> > 40c9b4d4d1f3 100644
> > --- a/gcc/tree-vect-slp.c
> > +++ b/gcc/tree-vect-slp.c
> > @@ -1764,25 +1764,69 @@ vect_build_slp_tree_2 (vec_info *vinfo,
> slp_tree node,
> >     {
> >       *max_nunits = this_max_nunits;
> >       (*tree_size)++;
> > -     node = vect_create_new_slp_node (node, stmts, 0);
> > -     SLP_TREE_VECTYPE (node) = vectype;
> > +     node = vect_create_new_slp_node (node, stmts, 1);
> >       /* And compute the load permutation.  Whether it is actually
> >          a permutation depends on the unrolling factor which is
> > -        decided later.  */
> > -     vec<unsigned> load_permutation;
> > +        decided later.  We specifically materialize permutes at this
> > +        early stage and leave it up to optimize SLP to push them back
> > +        into the load if needed.  */
> > +     lane_permutation_t lane_permutation;
> > +
> > +     /* See the loads with a linear permute, this so that optimize_slp
> > +        can later push the permute upwards and downwards without
> needing
> > +        to inspect the parent node.  */
> > +     load_permutation_t load_permutation;
> >       int j;
> > -     stmt_vec_info load_info;
> > +     stmt_vec_info load_info, curr_load;
> > +     lane_permutation.create (group_size);
> >       load_permutation.create (group_size);
> > +     vec<stmt_vec_info> group_stmts;
> > +     bool linear = true;
> >       stmt_vec_info first_stmt_info
> >         = DR_GROUP_FIRST_ELEMENT (SLP_TREE_SCALAR_STMTS
> (node)[0]);
> > +     group_stmts.create (DR_GROUP_SIZE (first_stmt_info));
> > +     curr_load = first_stmt_info;
> >       FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_STMTS (node), j,
> load_info)
> >         {
> >           int load_place = vect_get_place_in_interleaving_chain
> >               (load_info, first_stmt_info);
> >           gcc_assert (load_place != -1);
> > -         load_permutation.safe_push (load_place);
> > +
> > +         lane_permutation.safe_push (std::make_pair (0, load_place));
> > +         load_permutation.quick_push (j);
> > +
> > +         group_stmts.quick_push (curr_load);
> > +         linear = linear && curr_load == load_info;
> > +         curr_load = DR_GROUP_NEXT_ELEMENT (curr_load);
> > +       }
> > +
> > +     if (!linear)
> > +       {
> > +         slp_tree child;
> > +         if (slp_tree *load = bst_map->get (group_stmts))
> > +           child = *load;
> > +         else
> > +           {
> > +             child =  vect_create_new_slp_node (group_stmts.copy (),
> 0);
> > +             SLP_TREE_VECTYPE (child) = vectype;
> > +             /* One for me, one for the BST map.  */
> > +             SLP_TREE_REF_COUNT (child)++;
> > +             bst_map->put (group_stmts, child);
> > +           }
> > +
> > +         /* Set some metadata on the load node.  */
> > +         SLP_TREE_REF_COUNT (child)++;
> > +         SLP_TREE_LANES (child) = SLP_TREE_LANES (node);
> > +         SLP_TREE_LOAD_PERMUTATION (child) = load_permutation;
> > +
> > +         /* But return a permute node.  */
> > +         SLP_TREE_LANE_PERMUTATION (node) = lane_permutation;
> > +         SLP_TREE_CODE (node) = VEC_PERM_EXPR;
> > +         SLP_TREE_CHILDREN (node).quick_push (child);
> > +         SLP_TREE_SCALAR_STMTS (node) = vNULL;
> >         }
> > -     SLP_TREE_LOAD_PERMUTATION (node) = load_permutation;
> > +
> > +     SLP_TREE_VECTYPE (node) = vectype;
> >       return node;
> >     }
> >      }
> > @@ -3430,8 +3474,8 @@ vect_analyze_slp (vec_info *vinfo, unsigned
> max_tree_size)
> >        FOR_EACH_VEC_ELT (LOOP_VINFO_SLP_INSTANCES (vinfo), i, instance)
> >     {
> >       slp_tree root = SLP_INSTANCE_TREE (instance);
> > -     optimize_load_redistribution (bst_map, vinfo, SLP_TREE_LANES
> (root),
> > -                                   &load_map, root);
> > +     //optimize_load_redistribution (bst_map, vinfo, SLP_TREE_LANES
> (root),
> > +   //                              &load_map, root);
> >     }
> >      }
> >
> > @@ -3548,6 +3592,83 @@ vect_slp_perms_eq (const vec<vec<unsigned> >
> &perms,
> >                      sizeof (unsigned) * perms[perm_a].length ()) ==
> 0));  }
> >
> > +static bool
> > +is_pattern_convert_candidate_p (slp_tree node) {
> > +  stmt_vec_info stmt_info;
> > +  if (!(stmt_info = SLP_TREE_REPRESENTATIVE (node))
> > +      || !STMT_VINFO_SLP_VECT_ONLY_PATTERN (stmt_info))
> > +    return false;
> > +
> > +  gimple* stmt = STMT_VINFO_STMT (stmt_info);  if (!is_gimple_call
> > + (stmt))
> > +    return false;
> > +
> > +  if (!gimple_call_internal_p (stmt))
> > +    return false;
> > +
> > +  internal_fn ifn = gimple_call_internal_fn (stmt);
> > +
> > +  switch (ifn)
> > +    {
> > +    case IFN_ADDSUB:
> > +    case IFN_SUBADD:
> > +    case IFN_MUL_ADDSUB:
> > +    case IFN_MUL_SUBADD:
> > +      return true;
> > +    default:
> > +      break;
> > +    }
> > +
> > +  return false;
> > +}
> > +
> > +static internal_fn
> > +vect_get_transformed_pattern (internal_fn ifn, slp_tree /* node */) {
> > +  switch (ifn)
> > +    {
> > +    case IFN_ADDSUB:
> > +      ifn = IFN_COMPLEX_ADD_ROT270;
> > +      break;
> > +    case IFN_SUBADD:
> > +      ifn = IFN_COMPLEX_ADD_ROT90;
> > +      break;
> > +    case IFN_MUL_ADDSUB:
> > +      ifn = IFN_COMPLEX_MUL_CONJ;
> > +      break;
> > +    case IFN_MUL_SUBADD:
> > +      ifn = IFN_COMPLEX_MUL;
> > +      break;
> > +    default:
> > +      gcc_unreachable ();
> > +    }
> > +
> > +  return ifn;
> > +}
> > +
> > +
> > +static load_permutation_t
> > +vect_reverse_perm (load_permutation_t perm) {
> > +  auto_vec<std::pair<unsigned, unsigned> > pairs;
> > +  pairs.create (perm.length ());
> > +  unsigned i;
> > +  for (i = 0; i < perm.length (); i++)
> > +    pairs.quick_push (std::make_pair (i, perm[i]));
> > +
> > +  typedef const std::pair<unsigned, unsigned>* cmp_t;  pairs.qsort
> > + ([](const void* a, const void* b) -> int
> > +    { return (int)((cmp_t)a)->second - (int)((cmp_t)b)->second; });
> > +
> > +  load_permutation_t rev_perm;
> > +  rev_perm.create (perm.length ());
> > +  for (i = 0; i < perm.length (); i++)
> > +    rev_perm.quick_push (pairs[i].first);
> > +
> > +  return rev_perm;
> > +}
> > +
> >  /* Optimize the SLP graph of VINFO.  */
> >
> >  void
> > @@ -3578,11 +3699,13 @@ vect_optimize_slp (vec_info *vinfo)
> >
> >    auto_sbitmap n_visited (vertices.length ());
> >    auto_sbitmap n_materialize (vertices.length ());
> > +  auto_sbitmap n_transform (vertices.length ());
> >    auto_vec<int> n_perm (vertices.length ());
> >    auto_vec<vec<unsigned> > perms;
> >
> >    bitmap_clear (n_visited);
> >    bitmap_clear (n_materialize);
> > +  bitmap_clear (n_transform);
> >    n_perm.quick_grow_cleared (vertices.length ());
> >    perms.safe_push (vNULL); /* zero is no permute */
> >
> > @@ -3602,55 +3725,73 @@ vect_optimize_slp (vec_info *vinfo)
> >      as entries to the reverse graph.  */
> >        if (!slpg->vertices[idx].succ)
> >     bitmap_set_bit (n_visited, idx);
> > -      /* Loads are the only thing generating permutes.  */
> > -      if (!SLP_TREE_LOAD_PERMUTATION (node).exists ())
> > -   continue;
> >
> > -      /* If splitting out a SLP_TREE_LANE_PERMUTATION can make the
> > -    node unpermuted, record this permute.  */
> > -      stmt_vec_info dr_stmt = SLP_TREE_REPRESENTATIVE (node);
> > -      if (!STMT_VINFO_GROUPED_ACCESS (dr_stmt))
> > -   continue;
> > -      dr_stmt = DR_GROUP_FIRST_ELEMENT (dr_stmt);
> > -      unsigned imin = DR_GROUP_SIZE (dr_stmt) + 1, imax = 0;
> > -      bool any_permute = false;
> > -      for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j)
> > +      int perm_seed = 0;
> > +
> > +      for (graph_edge *pred = slpg->vertices[idx].pred;
> > +      pred; pred = pred->pred_next)
> >     {
> > -     unsigned idx = SLP_TREE_LOAD_PERMUTATION (node)[j];
> > -     imin = MIN (imin, idx);
> > -     imax = MAX (imax, idx);
> > -     if (idx - SLP_TREE_LOAD_PERMUTATION (node)[0] != j)
> > -       any_permute = true;
> > -   }
> > -      /* If there's no permute no need to split one out.  */
> > -      if (!any_permute)
> > -   continue;
> > -      /* If the span doesn't match we'd disrupt VF computation, avoid
> > -    that for now.  */
> > -      if (imax - imin + 1 != SLP_TREE_LANES (node))
> > -   continue;
> > +     int src_idx = pred->src;
> > +     slp_tree src_node = vertices[src_idx];
> >
> > -      /* For now only handle true permutes, like
> > -    vect_attempt_slp_rearrange_stmts did.  This allows us to be lazy
> > -    when permuting constants and invariants keeping the permute
> > -    bijective.  */
> > -      auto_sbitmap load_index (SLP_TREE_LANES (node));
> > -      bitmap_clear (load_index);
> > -      for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j)
> > -   bitmap_set_bit (load_index, SLP_TREE_LOAD_PERMUTATION
> (node)[j] - imin);
> > -      unsigned j;
> > -      for (j = 0; j < SLP_TREE_LANES (node); ++j)
> > -   if (!bitmap_bit_p (load_index, j))
> > -     break;
> > -      if (j != SLP_TREE_LANES (node))
> > -   continue;
> > +     /* Loads are the only thing generating permutes, however the
> permutes
> > +        are not yet lowered into the loads.  So instead chase up the chain
> > +        to find a permute node.  */
> > +     if (!SLP_TREE_LANE_PERMUTATION (src_node).exists ())
> > +       continue;
> > +
> > +     /* If splitting out a SLP_TREE_LANE_PERMUTATION can make the
> > +        node unpermuted, record this permute.  */
> > +     stmt_vec_info dr_stmt = SLP_TREE_REPRESENTATIVE (node);
> > +     if (!STMT_VINFO_GROUPED_ACCESS (dr_stmt))
> > +       continue;
> > +     dr_stmt = DR_GROUP_FIRST_ELEMENT (dr_stmt);
> > +     unsigned imin = DR_GROUP_SIZE (dr_stmt) + 1, imax = 0;
> > +     bool any_permute = false;
> > +     for (unsigned j = 0; j < SLP_TREE_LANES (src_node); ++j)
> > +       {
> > +         unsigned src_op = SLP_TREE_LANE_PERMUTATION
> (src_node)[j].first;
> > +         unsigned idx = SLP_TREE_LANE_PERMUTATION
> (src_node)[j].second;
> > +         /* This isn't a load permute moved out.  */
> > +         if (src_op != 0)
> > +           continue;
> > +         imin = MIN (imin, idx);
> > +         imax = MAX (imax, idx);
> > +         if (idx - SLP_TREE_LANE_PERMUTATION (src_node)[0].second != j)
> > +           any_permute = true;
> > +       }
> > +     /* If there's no permute no need to split one out.  */
> > +     if (!any_permute)
> > +       continue;
> > +     /* If the span doesn't match we'd disrupt VF computation, avoid
> > +        that for now.  */
> > +     if (imax - imin + 1 != SLP_TREE_LANES (node))
> > +       continue;
> > +
> > +     /* For now only handle true permutes, like
> > +        vect_attempt_slp_rearrange_stmts did.  This allows us to be lazy
> > +        when permuting constants and invariants keeping the permute
> > +        bijective.  */
> > +     auto_sbitmap load_index (SLP_TREE_LANES (src_node));
> > +     bitmap_clear (load_index);
> > +     for (unsigned j = 0; j < SLP_TREE_LANES (src_node); ++j)
> > +       bitmap_set_bit (load_index, SLP_TREE_LANE_PERMUTATION
> (src_node)[j].second - imin);
> > +     unsigned j;
> > +     for (j = 0; j < SLP_TREE_LANES (src_node); ++j)
> > +     if (!bitmap_bit_p (load_index, j))
> > +       break;
> > +     if (j != SLP_TREE_LANES (src_node))
> > +       continue;
> >
> > -      vec<unsigned> perm = vNULL;
> > -      perm.safe_grow (SLP_TREE_LANES (node), true);
> > -      for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j)
> > -   perm[j] = SLP_TREE_LOAD_PERMUTATION (node)[j] - imin;
> > -      perms.safe_push (perm);
> > -      n_perm[idx] = perms.length () - 1;
> > +     vec<unsigned> perm = vNULL;
> > +     perm.safe_grow (SLP_TREE_LANES (src_node), true);
> > +     for (unsigned j = 0; j < SLP_TREE_LANES (src_node); ++j)
> > +       perm[j] = SLP_TREE_LANE_PERMUTATION (src_node)[j].second -
> imin;
> > +     perms.safe_push (perm);
> > +     n_perm[src_idx] = perms.length () - 1;
> > +     perm_seed = -1;
> > +       }
> > +      n_perm[idx] = perm_seed;
> >      }
> >
> >    /* Propagate permutes along the graph and compute materialization
> > points.  */ @@ -3667,12 +3808,12 @@ vect_optimize_slp (vec_info *vinfo)
> >       slp_tree node = vertices[idx];
> >       /* For leafs there's nothing to do - we've seeded permutes
> >          on those above.  */
> > -     if (SLP_TREE_DEF_TYPE (node) != vect_internal_def)
> > +     if (SLP_TREE_DEF_TYPE (node) != vect_internal_def ||
> > +!slpg->vertices[idx].succ)
> >         continue;
> >
> >       bitmap_set_bit (n_visited, idx);
> >
> > -     /* We cannot move a permute across a store.  */
> > +     /* We cannot move a permute across a store.  TODO: or a call.  */
> >       if (STMT_VINFO_DATA_REF (SLP_TREE_REPRESENTATIVE (node))
> >           && DR_IS_WRITE
> >                (STMT_VINFO_DATA_REF (SLP_TREE_REPRESENTATIVE
> (node)))) @@
> > -3723,6 +3864,12 @@ vect_optimize_slp (vec_info *vinfo)
> >           changed = true;
> >         }
> >
> > +     /* Check to see if this node can be transformed during permute
> > +        materialization.  */
> > +     bool patt_trans_cand_p = is_pattern_convert_candidate_p (node);
> > +     if (patt_trans_cand_p)
> > +       bitmap_set_bit (n_transform, idx);
> > +
> >       if (perm == 0)
> >         continue;
> >
> > @@ -3764,6 +3911,46 @@ vect_optimize_slp (vec_info *vinfo)
> >      }
> >    while (changed || iteration == 1);
> >
> > +  bitmap_clear (n_visited);
> > +
> > +  /* Transform and record any permutes for the usage of the instruction.
> > +     TODO: Check if it's cheaper to keep ADDSUB if available or use to
> > +      COMPLEX_ADD based on the reverse permute that's needed.  */
> > +  sbitmap_iterator bi;
> > +  EXECUTE_IF_SET_IN_BITMAP (n_transform, 0, i, bi)
> > +    {
> > +      slp_tree node = vertices[i];
> > +
> > +      for (graph_edge *succ = slpg->vertices[i].succ;
> > +                  succ; succ = succ->succ_next)
> > +   {
> > +     int perm_id = n_perm[succ->dest];
> > +     if (perm_id <= 0)
> > +       continue;
> > +
> > +     load_permutation_t linperm = vect_reverse_perm
> (perms[perm_id]);
> > +     perms[perm_id].release ();
> > +     perms[perm_id] = linperm;
> > +
> > +          bitmap_set_bit (n_materialize, succ->dest);
> > +   }
> > +
> > +      /* Transform the node if we have determined that it's beneficial to 
> > do
> so.
> > +    when we do so the perm that has been calculated for the children of
> the
> > +    node will be applied.  */
> > +      stmt_vec_info stmt_info = SLP_TREE_REPRESENTATIVE (node);
> > +      gimple* stmt = STMT_VINFO_STMT (stmt_info);
> > +      internal_fn old_ifn = gimple_call_internal_fn (stmt);
> > +      internal_fn ifn = vect_get_transformed_pattern (old_ifn ,node);
> > +
> > +      if (dump_enabled_p ())
> > +   dump_printf_loc (MSG_NOTE, vect_location,
> > +                    "Tranforming SLP expression from %s to %s\n",
> > +                    internal_fn_name (old_ifn), internal_fn_name (ifn));
> > +      gimple_call_set_internal_fn (as_a <gcall *> (stmt), ifn);
> > +    }
> > +
> > +
> >    /* Materialize.  */
> >    for (i = 0; i < vertices.length (); ++i)
> >      {
> > @@ -3798,6 +3985,11 @@ vect_optimize_slp (vec_info *vinfo)
> >                         SLP_TREE_SCALAR_OPS (child), true);
> >     }
> >
> > +      if (dump_enabled_p ())
> > +   dump_printf_loc (MSG_NOTE, vect_location,
> > +                    "processing node %p\n",
> > +                    node);
> > +
> >        if (bitmap_bit_p (n_materialize, i))
> >     {
> >       if (SLP_TREE_LOAD_PERMUTATION (node).exists ()) @@ -3986,6
> > +4178,16 @@ vect_optimize_slp (vec_info *vinfo)
> >         }
> >     }
> >      }
> > +
> > +  if (dump_enabled_p ())
> > +    {
> > +      dump_printf_loc (MSG_NOTE, vect_location, "Optimized SLP instances:
> \n");
> > +      hash_set<slp_tree> visited;
> > +      slp_instance instance;
> > +      FOR_EACH_VEC_ELT (vinfo->slp_instances, i, instance)
> > +   vect_print_slp_graph (MSG_NOTE, vect_location,
> > +                         SLP_INSTANCE_TREE (instance), visited);
> > +    }
> >  }
> >
> >  /* Gather loads reachable from the individual SLP graph entries.  */
> >
> >
> >
> 
> --
> Richard Biener <rguent...@suse.de>
> SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409
> Nuernberg, Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)

Reply via email to