> -----Original Message----- > From: Richard Biener <rguent...@suse.de> > Sent: Tuesday, June 22, 2021 1:08 PM > To: Tamar Christina <tamar.christ...@arm.com> > Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com> > Subject: Re: [PATCH]middle-end[RFC] slp: new implementation of complex > numbers > > On Mon, 21 Jun 2021, Tamar Christina wrote: > > > Hi Richi, > > > > This patch is still very much incomplete and I do know that it is > > missing things but it's complete enough such that examples are working > > and allows me to show what I'm working towards. > > > > note, that this approach will remove a lot of code in > > tree-vect-slp-patterns but to keep the diff readable I've left them in > > and just commented out the calls or removed them where needed. > > > > The patch rewrites the complex numbers detection by splitting the > > detection of structure from dataflow analysis. In principle the > > biggest difference between this and the previous implementation is > > that instead of trying to detect valid complex operations it *makes* > > an operation a valid complex operation. > > > > To do this each operation gets a dual optab which matches the same > > structure but has no dataflow requirement. > > > > i.e. in this patch I added 4, ADDSUB, SUBADD, MUL_ADDSUB, > MULL_SUBADD. > > > > There is a then a mapping between these and their variant with the > dataflow: > > > > * ADDSUB -> COMPLEX_ADD_ROT270 > > * SUBADD -> COMPLEX_ADD_ROT90 > > * MUL_ADDSUB -> COMPLEX_MUL_CONJ > > * MUL_SUBADD -> COMPLEX_MUL > > > > with the intention that when we detect the structure of an operation > > we query the backend for both optabs. > > > > This should result in one of three states: > > > > * not supported: Move on. > > * Supports ADDSUB only: Rewrite using ADDSUB, set type to > 'cannot_transform' > > * Supports COMPLEX_ADD_ROT270 only: Rewrite using ADDSUB, set type > to 'must_transform' > > * Supports both: Rewrite using ADDSUB, set type fo 'can_transform' > > > > with the idea behind `can_transform` is to check the costs of the > > inverse permute needed to use the complex operation and if this is > > very expensive then stick to addsub. This requires the target to be > > able to cost the operations reasonably correct. > > > > So for ADD this looks like > > > > === vect_match_slp_patterns === > > Analyzing SLP tree 0x494e970 for patterns Found ADDSUB pattern in > > SLP tree Target does not support ADDSUB for vector type vector(4) > > float Found COMPLEX_ADD_ROT270 pattern in SLP tree Target supports > > COMPLEX_ADD_ROT270 vectorization with mode vector(4) float Pattern > > matched SLP tree node 0x494e970 (max_nunits=4, refcnt=1) op template: > > REALPART_EXPR <*_10> = _23; > > stmt 0 REALPART_EXPR <*_10> = _23; > > stmt 1 IMAGPART_EXPR <*_10> = _22; > > children 0x494ea00 > > node 0x494ea00 (max_nunits=4, refcnt=1) op template: slp_patt_39 = > > .ADDSUB (_23, _23); > > stmt 0 _23 = _6 + _13; > > stmt 1 _22 = _12 - _8; > > children 0x494eb20 0x494ebb0 > > node 0x494eb20 (max_nunits=4, refcnt=1) op template: _13 = > > REALPART_EXPR <*_3>; > > stmt 0 _13 = REALPART_EXPR <*_3>; > > stmt 1 _12 = IMAGPART_EXPR <*_3>; > > node 0x494ebb0 (max_nunits=4, refcnt=1) > > op: VEC_PERM_EXPR > > { } > > lane permutation { 0[1] 0[0] } > > children 0x494ec40 > > node 0x494ec40 (max_nunits=1, refcnt=2) op template: _8 = > > REALPART_EXPR <*_5>; > > stmt 0 _8 = REALPART_EXPR <*_5>; > > stmt 1 _6 = IMAGPART_EXPR <*_5>; > > load permutation { 0 1 } > > > > and later during optimize_slp we get > > > > Tranforming SLP expression from ADDSUB to COMPLEX_ADD_ROT270 > > processing node 0x494ebb0 simplifying permute node 0x494ebb0 > Optimized > > SLP instances: > > node 0x494e970 (max_nunits=4, refcnt=1) op template: REALPART_EXPR > > <*_10> = _23; > > stmt 0 REALPART_EXPR <*_10> = _23; > > stmt 1 IMAGPART_EXPR <*_10> = _22; > > children 0x494ea00 > > node 0x494ea00 (max_nunits=4, refcnt=1) op template: slp_patt_39 = > > .COMPLEX_ADD_ROT270 (_23, _23); > > stmt 0 _23 = _6 + _13; > > stmt 1 _22 = _12 - _8; > > children 0x494eb20 0x494ebb0 > > node 0x494eb20 (max_nunits=4, refcnt=1) op template: _13 = > > REALPART_EXPR <*_3>; > > stmt 0 _13 = REALPART_EXPR <*_3>; > > stmt 1 _12 = IMAGPART_EXPR <*_3>; > > node 0x494ebb0 (max_nunits=4, refcnt=1) > > op: VEC_PERM_EXPR > > { } > > lane permutation { 0[0] 0[1] } > > children 0x494ec40 > > node 0x494ec40 (max_nunits=1, refcnt=2) op template: _8 = > > REALPART_EXPR <*_5>; > > stmt 0 _8 = REALPART_EXPR <*_5>; > > stmt 1 _6 = IMAGPART_EXPR <*_5>; > > > > Now I still have to elide the VEC_PERM_EXPR here but that's easy. > > So having skimmed half of the patch - this means SLP pattern recog will > initially recognize { +, -, +, - } as ADDSUB for example but not factor in > lane > permutes from loads yet. Now, suppose we have { +, -, -, + } seen in pattern > recog - how's that handled?
These will be rejected, the lanes are still checked to ensure that it's a { +, - ,+, -}. The lane permutes are checked in unchanged code, namely vect_detect_pair_op (slp_tree node, bool two_operands = true,...) which only calls ::matches when the operation is consistent. > It might feed operations where we'd like to have inputs permuted and thus > would end up with ADDSUB in the end? In principle we could allow this, if we were to do this then one way to handle it would be to add a permute for ADDSUB as well to permute the lanes. I think it's better in this scenario to have the pattern recog code insert the permute early on then, in the ::build phase. This will then maintain correctness in the SLP tree and the extra permute will be dealt with in optimize_slp's normal working. It becomes a bit tricky here though, since such cases are perhaps better handled by Increasing the VF and using permuted loads (LOAD_LANES). > > That said, you do > > + /* Check to see if this node can be transformed during permute > + materialization. */ > + bool patt_trans_cand_p = is_pattern_convert_candidate_p (node); > + if (patt_trans_cand_p) > + bitmap_set_bit (n_transform, idx); > > in the propagation stage (but this looks like a static marking). Yeah indeed, it's just a way to say "inspect these later". > > And then just for each of those candidates you somehow "undo" the > permutes on the SLP childs (by modifying it in-place - uh, see below). > But if you figure one child that is not permuted you've done that anyway but > stop. Yeah, if the child is unpermuted, vect_reverse_perm returns the identity permute which is then later elided. In the final version this should just not set permute at all. > > So what I've expected is that this "transform" is done during propagation > (where you now set the n_transform bit), by means of detecting the > candidate as materialization point (but I'd have to recap the kind of permutes > we expect to be incoming). That is, for a specific set of permutes on the > successor edges claim the node can handle the permute and thus propagate > unpermuted upwards. > > At the materialization stage you'd then transform the node. Yeah, that makes sense. I separated them out to have distinct "stages", but yeah the loops can be combined. I'll do that. > > Btw, how do we handle targets that can do complex_addsub but not addsub? > That's why during pattern matching the optab code checks both. ADDSUB is temporarily accepted and then the transform code MUST transform it. So ADDSUB cannot leave optimize_slp is the operation is not supported. This is what I meant with the three states (missing from this patch) * not supported: Move on. * Supports ADDSUB only: Rewrite using ADDSUB, set type to 'cannot_transform' * Supports COMPLEX_ADD_ROT270 only: Rewrite using ADDSUB, set type to 'must_transform' * Supports both: Rewrite using ADDSUB, set type fo 'can_transform' > > This works for ADD, but it doesn't work well when there's a > > complicated sequence of loads. So for example > > > > #define N 200 > > void g (double complex a[restrict N], double complex b[restrict N], > > double complex c[restrict N]) > > { > > for (int i=0; i < N; i++) > > { > > c[i] = a[i] * b[i]; > > } > > } > > > > will results in an SLP tree where each operand of the multiply does > > not get to see all the original vectors: > > > > Final SLP tree for instance 0x5678a30: > > node 0x55965a0 (max_nunits=4, refcnt=2) op template: REALPART_EXPR > > <*_7> = _25; > > stmt 0 REALPART_EXPR <*_7> = _25; > > stmt 1 IMAGPART_EXPR <*_7> = _26; > > children 0x5596630 > > node 0x5596630 (max_nunits=4, refcnt=2) > > op: VEC_PERM_EXPR > > stmt 0 _25 = _17 - _22; > > stmt 1 _26 = _23 + _24; > > lane permutation { 0[0] 1[1] } > > children 0x5596a20 0x5596ab0 > > node 0x5596a20 (max_nunits=1, refcnt=1) op template: _25 = _17 - _22; > > { } > > children 0x55966c0 0x5596870 > > node 0x55966c0 (max_nunits=4, refcnt=3) op template: _17 = _10 * _19; > > stmt 0 _17 = _10 * _19; > > stmt 1 _23 = _10 * _18; > > children 0x5596750 0x55967e0 > > node 0x5596750 (max_nunits=4, refcnt=2) op template: _10 = > > REALPART_EXPR <*_3>; > > stmt 0 _10 = REALPART_EXPR <*_3>; > > stmt 1 _10 = REALPART_EXPR <*_3>; > > load permutation { 0 0 } > > node 0x55967e0 (max_nunits=4, refcnt=2) op template: _19 = > > REALPART_EXPR <*_5>; > > stmt 0 _19 = REALPART_EXPR <*_5>; > > stmt 1 _18 = IMAGPART_EXPR <*_5>; > > load permutation { 0 1 } > > node 0x5596870 (max_nunits=4, refcnt=3) op template: _22 = _9 * _18; > > stmt 0 _22 = _9 * _18; > > stmt 1 _24 = _9 * _19; > > children 0x5596900 0x5596990 > > node 0x5596900 (max_nunits=4, refcnt=2) op template: _9 = > > IMAGPART_EXPR <*_3>; > > stmt 0 _9 = IMAGPART_EXPR <*_3>; > > stmt 1 _9 = IMAGPART_EXPR <*_3>; > > load permutation { 1 1 } > > node 0x5596990 (max_nunits=4, refcnt=2) op template: _18 = > > IMAGPART_EXPR <*_5>; > > stmt 0 _18 = IMAGPART_EXPR <*_5>; > > stmt 1 _19 = REALPART_EXPR <*_5>; > > load permutation { 1 0 } > > node 0x5596ab0 (max_nunits=1, refcnt=1) op template: _26 = _23 + _24; > > { } > > children 0x55966c0 0x5596870 > > > > So depending on the node each multiply either sees REAL 3 + REAL 5 + > > IMAG 5 or IMAG 3 + IMAG 5 + REAL 5. > > > > since we are removing the TWO_OPERANDS node we need to drop one of > the > > multiply and so we need to give it the original 2 vectors as a > > parameter. The current implementation emits a permute operation to > > combine the loads again and later pushes the permute down. This is > > problematic as it again requires us to do df analysis early. > > > > To counter this, in the patch I have changed loads to no longer come > > out of build_slp with LOAD_PERMUTES but instead to have a VEC_PERM > above each load. > > Yep. I've been there before (not sure if I ever sent you my work-in-progress > here). There's some wrongs in your patch but I guess doing this exactly for > the permutes optimize_slp handles would be fine. Yeah, it fell apart when testing x264 code above so I noticed I had some of the lanes wrong, but didn't fix it yet as wasn't needed for my testcases. > > We should see to do this independently of the stuff above, I can handle this > and will prepare a patch for this later. > Thanks! That's a big help! > > This is only temporary and the idea is that optimize SLP will push > > them back down into the loads once it's deemed there are still permutes > needed. > > > > So COMPLEX MUL becomes: > > > > Final SLP tree for instance 0x46d9c80: > > node 0x45f76c0 (max_nunits=4, refcnt=2) op template: REALPART_EXPR > > <*_7> = _25; > > stmt 0 REALPART_EXPR <*_7> = _25; > > stmt 1 IMAGPART_EXPR <*_7> = _26; > > children 0x45f7750 > > node 0x45f7750 (max_nunits=4, refcnt=2) > > op: VEC_PERM_EXPR > > stmt 0 _25 = _17 - _22; > > stmt 1 _26 = _23 + _24; > > lane permutation { 0[0] 1[1] } > > children 0x45f7bd0 0x45f7c60 > > node 0x45f7bd0 (max_nunits=1, refcnt=1) op template: _25 = _17 - _22; > > { } > > children 0x45f77e0 0x45f7a20 > > node 0x45f77e0 (max_nunits=4, refcnt=3) op template: _17 = _10 * _19; > > stmt 0 _17 = _10 * _19; > > stmt 1 _23 = _10 * _18; > > children 0x45f7870 0x45f7990 > > node 0x45f7870 (max_nunits=4, refcnt=2) > > op: VEC_PERM_EXPR > > { } > > lane permutation { 0[0] 0[0] } > > children 0x45f7900 > > node 0x45f7900 (max_nunits=1, refcnt=4) op template: _10 = > > REALPART_EXPR <*_3>; > > stmt 0 _10 = REALPART_EXPR <*_3>; > > stmt 1 _9 = IMAGPART_EXPR <*_3>; > > load permutation { 0 1 } > > node 0x45f7990 (max_nunits=4, refcnt=3) op template: _19 = > > REALPART_EXPR <*_5>; > > stmt 0 _19 = REALPART_EXPR <*_5>; > > stmt 1 _18 = IMAGPART_EXPR <*_5>; > > load permutation { 0 1 } > > node 0x45f7a20 (max_nunits=4, refcnt=3) op template: _22 = _9 * _18; > > stmt 0 _22 = _9 * _18; > > stmt 1 _24 = _9 * _19; > > children 0x45f7ab0 0x45f7b40 > > node 0x45f7ab0 (max_nunits=4, refcnt=2) > > op: VEC_PERM_EXPR > > { } > > lane permutation { 0[1] 0[1] } > > children 0x45f7900 > > node 0x45f7b40 (max_nunits=4, refcnt=2) > > op: VEC_PERM_EXPR > > { } > > lane permutation { 0[1] 0[0] } > > children 0x45f7990 > > node 0x45f7c60 (max_nunits=1, refcnt=1) op template: _26 = _23 + _24; > > { } > > children 0x45f77e0 0x45f7a20 > > > > This representation has a number of advantages: > > > > * complex mul becomes trivial, as trivial as add. > > * add now detects itself as part of a sequence of complex mul. i.e. if you > > have only add, it will use it instead of failing to follow the df like > > now. > > * It opens the door a bit more to vectorize the first loop in x264. The > > code > > there has a cross iteration mix of loading the full vector but in an > > order > > that the vectorizer thinks it can't use a single linear load for. e.g. > > > > .. = (a[0] + b[0]) - (a[4] + b[4]) > > .. = (a[1] + b[1]) - (a[5] + b[5]) > > .. = (a[2] + b[2]) - (a[6] + b[6]) > > .. = (a[3] + b[3]) - (a[7] + b[6]) > > > > Ultimately it thinks it's cheaper to use scalar loads here, while (at > > least for > > aarch64) if we see the unrolled loop together we can pattern match > > this sequence and rewrite it to something that uses the linear load. > > This is because the addition and subtract are widening. So the > > widening arithmetic operations we have already only read half the vector > since they're widening. > > > > For this to work, we need to see all 16 bytes loads, so the full > > unrolled loop of 4 iterations. Which currently doesn't happen for > > NEON but does for SVE. (though I wonder if now that you've added SLP > > decomposition if the initial SLP build shouldn't allow building of > > SLP trees that are wider than the vector length and later just > > decompose it or bump up the VF.). However this is something for another > discussion :). > > > > Changing the loads like this makes things a lot simpler and we > > require no additional work anywhere else. The complexity ends up in > > optimize SLP which now has the following changes: > > > > * loads no longer introduce permutes, so they don't see the permutations. > > Instead each load can lead to a permute which can see the permutes. > > * As we're moving permutes up we stop at call boundaries as well. > > * As we're calculating materialization points for the permute we inspect > the > > node and check if we can transform it, if so we mark it for > > transformation. > > * After calculating the materialization and transform points we iterate > > over > > the transform points checking whether we should transform. If we > should we > > modify the permute at the materialization point (which should be at the > > transform point) into the inverse permute. > > * During materialization as the permute is transformed this would undo > the > > permutation if we transformed the instruction. > > * After computing this we elide the permute node, or push it down > towards the > > load and then back into the load itself, splitting the node if required. > > > > After this the tree should be back to it's normal form. I still need > > to work out when and where one needs to push down the permute all the > > way to the loads. > > It's indeed the case that the current permute "propagation" is away from > loads only and a phase to propagate in the reverse direction is missing if we > now have permute sources other than loads. The cycle handling of the > propagation is also a bit awkward and incomplete (but it "works"). > The thing I'm also wondering about is if it's always better to push it back in, e.g. if from a given load you have different permuted variants of the result if it can't be beneficial to sometimes keep the permutes and do the load only once. Thanks, Tamar > Richard. > > > But this covers my thoughts. Any feedback is appreciated to get it > > right early on :) > > > > Regards, > > Tamar > > > > --- inline copy of patch -- > > diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def index > > > b2f414d2131b867eda337cd30f5ed40ed7c9fa10..73d1d742414d16e06e8312967 > 990 > > d565eddb218d 100644 > > --- a/gcc/internal-fn.def > > +++ b/gcc/internal-fn.def > > @@ -277,6 +277,10 @@ DEF_INTERNAL_FLT_FN (SCALB, ECF_CONST, scalb, > > binary) DEF_INTERNAL_FLT_FLOATN_FN (FMIN, ECF_CONST, fmin, binary) > > DEF_INTERNAL_FLT_FLOATN_FN (FMAX, ECF_CONST, fmax, binary) > > DEF_INTERNAL_OPTAB_FN (XORSIGN, ECF_CONST, xorsign, binary) > > +DEF_INTERNAL_OPTAB_FN (ADDSUB, ECF_CONST, addsub, binary) > > +DEF_INTERNAL_OPTAB_FN (SUBADD, ECF_CONST, subadd, binary) > > +DEF_INTERNAL_OPTAB_FN (MUL_ADDSUB, ECF_CONST, mul_addsub, > binary) > > +DEF_INTERNAL_OPTAB_FN (MUL_SUBADD, ECF_CONST, mul_subadd, > binary) > > DEF_INTERNAL_OPTAB_FN (COMPLEX_ADD_ROT90, ECF_CONST, cadd90, > binary) > > DEF_INTERNAL_OPTAB_FN (COMPLEX_ADD_ROT270, ECF_CONST, > cadd270, binary) > > DEF_INTERNAL_OPTAB_FN (COMPLEX_MUL, ECF_CONST, cmul, binary) > diff > > --git a/gcc/optabs.def b/gcc/optabs.def index > > > b192a9d070b8aa72e5676b2eaa020b5bdd7ffcc8..73d23c49bdfa7423f67a62406 > 50c > > 558cfaf3f2f0 100644 > > --- a/gcc/optabs.def > > +++ b/gcc/optabs.def > > @@ -290,6 +290,10 @@ OPTAB_D (atan_optab, "atan$a2") OPTAB_D > > (atanh_optab, "atanh$a2") OPTAB_D (copysign_optab, "copysign$F$a3") > > OPTAB_D (xorsign_optab, "xorsign$F$a3") > > +OPTAB_D (addsub_optab, "addsub$a3") > > +OPTAB_D (subadd_optab, "subadd$a3") > > +OPTAB_D (mul_addsub_optab, "mul_addsub$a3") OPTAB_D > > +(mul_subadd_optab, "mul_subadd$a3") > > OPTAB_D (cadd90_optab, "cadd90$a3") > > OPTAB_D (cadd270_optab, "cadd270$a3") OPTAB_D (cmul_optab, > > "cmul$a3") diff --git a/gcc/tree-vect-slp-patterns.c > > b/gcc/tree-vect-slp-patterns.c index > > > 2ed49cd9edcabd7948b365dd60d7405b79079a7b..ce8d2daa1f06f17eda2cb082 > 55d7 > > 6bac9cf263f0 100644 > > --- a/gcc/tree-vect-slp-patterns.c > > +++ b/gcc/tree-vect-slp-patterns.c > > @@ -68,6 +68,39 @@ along with GCC; see the file COPYING3. If not see > > * vect_pattern class > > > > > ********************************************************** > ************ > > ********/ > > > > +static bool > > +vect_pattern_validate_optab (internal_fn ifn, slp_tree node); > > + > > + > > +/* Check if the base or complex equivalents are supported. */ static > > +bool vect_pattern_validate_optab_or_eq (internal_fn ifn, slp_tree > > +node) { > > + if (vect_pattern_validate_optab (ifn, node)) > > + return true; > > + > > + switch (ifn) > > + { > > + case IFN_ADDSUB: > > + ifn = IFN_COMPLEX_ADD_ROT270; > > + break; > > + case IFN_SUBADD: > > + ifn = IFN_COMPLEX_ADD_ROT90; > > + break; > > + case IFN_MUL_ADDSUB: > > + ifn = IFN_COMPLEX_MUL_CONJ; > > + break; > > + case IFN_MUL_SUBADD: > > + ifn = IFN_COMPLEX_MUL; > > + break; > > + default: > > + gcc_unreachable (); > > + } > > + > > + return vect_pattern_validate_optab (ifn, node); } > > + > > + > > /* Default implementation of recognize that performs matching, validation > and > > replacement of nodes but that can be overriden if required. */ > > > > @@ -630,8 +663,7 @@ complex_add_pattern::build (vec_info *vinfo) > > > > /* First re-arrange the children. */ > > SLP_TREE_CHILDREN (*this->m_node)[0] = children[0]; > > - SLP_TREE_CHILDREN (*this->m_node)[1] = > > - vect_build_swap_evenodd_node (children[1]); > > + SLP_TREE_CHILDREN (*this->m_node)[1] = children[1]; > > > > SLP_TREE_REF_COUNT (SLP_TREE_CHILDREN (*this->m_node)[0])++; > > SLP_TREE_REF_COUNT (SLP_TREE_CHILDREN (*this->m_node)[1])++; > @@ > > -670,9 +702,9 @@ complex_add_pattern::matches (complex_operation_t > op, > > Rotation 0 and 180 can be handled by normal SIMD code, so we don't > need > > to care about them here. */ > > if (op == MINUS_PLUS) > > - ifn = IFN_COMPLEX_ADD_ROT90; > > + ifn = IFN_SUBADD; > > else if (op == PLUS_MINUS) > > - ifn = IFN_COMPLEX_ADD_ROT270; > > + ifn = IFN_ADDSUB; > > else > > return ifn; > > > > @@ -680,17 +712,7 @@ complex_add_pattern::matches > (complex_operation_t op, > > we support. */ > > gcc_assert (ops->length () == 2); > > > > - vec<slp_tree> children = SLP_TREE_CHILDREN ((*ops)[0]); > > - > > - /* First node must be unpermuted. */ > > - if (linear_loads_p (perm_cache, children[0]) != PERM_EVENODD) > > - return IFN_LAST; > > - > > - /* Second node must be permuted. */ > > - if (linear_loads_p (perm_cache, children[1]) != PERM_ODDEVEN) > > - return IFN_LAST; > > - > > - if (!vect_pattern_validate_optab (ifn, *node)) > > + if (!vect_pattern_validate_optab_or_eq (ifn, *node)) > > return IFN_LAST; > > > > return ifn; > > @@ -1501,7 +1523,8 @@ vect_pattern_decl_t slp_patterns[] > > order patterns from the largest to the smallest. Especially if they > > overlap in what they can detect. */ > > > > - SLP_PATTERN (complex_operations_pattern), > > + SLP_PATTERN (complex_add_pattern), > > + //SLP_PATTERN (complex_operations_pattern), > > }; > > #undef SLP_PATTERN > > > > diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c index > > > 0c1f85beeb2e9f3fb7c66c15d4d30594b2570f9e..f24ad85d15780807c49153895b > 73 > > 40c9b4d4d1f3 100644 > > --- a/gcc/tree-vect-slp.c > > +++ b/gcc/tree-vect-slp.c > > @@ -1764,25 +1764,69 @@ vect_build_slp_tree_2 (vec_info *vinfo, > slp_tree node, > > { > > *max_nunits = this_max_nunits; > > (*tree_size)++; > > - node = vect_create_new_slp_node (node, stmts, 0); > > - SLP_TREE_VECTYPE (node) = vectype; > > + node = vect_create_new_slp_node (node, stmts, 1); > > /* And compute the load permutation. Whether it is actually > > a permutation depends on the unrolling factor which is > > - decided later. */ > > - vec<unsigned> load_permutation; > > + decided later. We specifically materialize permutes at this > > + early stage and leave it up to optimize SLP to push them back > > + into the load if needed. */ > > + lane_permutation_t lane_permutation; > > + > > + /* See the loads with a linear permute, this so that optimize_slp > > + can later push the permute upwards and downwards without > needing > > + to inspect the parent node. */ > > + load_permutation_t load_permutation; > > int j; > > - stmt_vec_info load_info; > > + stmt_vec_info load_info, curr_load; > > + lane_permutation.create (group_size); > > load_permutation.create (group_size); > > + vec<stmt_vec_info> group_stmts; > > + bool linear = true; > > stmt_vec_info first_stmt_info > > = DR_GROUP_FIRST_ELEMENT (SLP_TREE_SCALAR_STMTS > (node)[0]); > > + group_stmts.create (DR_GROUP_SIZE (first_stmt_info)); > > + curr_load = first_stmt_info; > > FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_STMTS (node), j, > load_info) > > { > > int load_place = vect_get_place_in_interleaving_chain > > (load_info, first_stmt_info); > > gcc_assert (load_place != -1); > > - load_permutation.safe_push (load_place); > > + > > + lane_permutation.safe_push (std::make_pair (0, load_place)); > > + load_permutation.quick_push (j); > > + > > + group_stmts.quick_push (curr_load); > > + linear = linear && curr_load == load_info; > > + curr_load = DR_GROUP_NEXT_ELEMENT (curr_load); > > + } > > + > > + if (!linear) > > + { > > + slp_tree child; > > + if (slp_tree *load = bst_map->get (group_stmts)) > > + child = *load; > > + else > > + { > > + child = vect_create_new_slp_node (group_stmts.copy (), > 0); > > + SLP_TREE_VECTYPE (child) = vectype; > > + /* One for me, one for the BST map. */ > > + SLP_TREE_REF_COUNT (child)++; > > + bst_map->put (group_stmts, child); > > + } > > + > > + /* Set some metadata on the load node. */ > > + SLP_TREE_REF_COUNT (child)++; > > + SLP_TREE_LANES (child) = SLP_TREE_LANES (node); > > + SLP_TREE_LOAD_PERMUTATION (child) = load_permutation; > > + > > + /* But return a permute node. */ > > + SLP_TREE_LANE_PERMUTATION (node) = lane_permutation; > > + SLP_TREE_CODE (node) = VEC_PERM_EXPR; > > + SLP_TREE_CHILDREN (node).quick_push (child); > > + SLP_TREE_SCALAR_STMTS (node) = vNULL; > > } > > - SLP_TREE_LOAD_PERMUTATION (node) = load_permutation; > > + > > + SLP_TREE_VECTYPE (node) = vectype; > > return node; > > } > > } > > @@ -3430,8 +3474,8 @@ vect_analyze_slp (vec_info *vinfo, unsigned > max_tree_size) > > FOR_EACH_VEC_ELT (LOOP_VINFO_SLP_INSTANCES (vinfo), i, instance) > > { > > slp_tree root = SLP_INSTANCE_TREE (instance); > > - optimize_load_redistribution (bst_map, vinfo, SLP_TREE_LANES > (root), > > - &load_map, root); > > + //optimize_load_redistribution (bst_map, vinfo, SLP_TREE_LANES > (root), > > + // &load_map, root); > > } > > } > > > > @@ -3548,6 +3592,83 @@ vect_slp_perms_eq (const vec<vec<unsigned> > > &perms, > > sizeof (unsigned) * perms[perm_a].length ()) == > 0)); } > > > > +static bool > > +is_pattern_convert_candidate_p (slp_tree node) { > > + stmt_vec_info stmt_info; > > + if (!(stmt_info = SLP_TREE_REPRESENTATIVE (node)) > > + || !STMT_VINFO_SLP_VECT_ONLY_PATTERN (stmt_info)) > > + return false; > > + > > + gimple* stmt = STMT_VINFO_STMT (stmt_info); if (!is_gimple_call > > + (stmt)) > > + return false; > > + > > + if (!gimple_call_internal_p (stmt)) > > + return false; > > + > > + internal_fn ifn = gimple_call_internal_fn (stmt); > > + > > + switch (ifn) > > + { > > + case IFN_ADDSUB: > > + case IFN_SUBADD: > > + case IFN_MUL_ADDSUB: > > + case IFN_MUL_SUBADD: > > + return true; > > + default: > > + break; > > + } > > + > > + return false; > > +} > > + > > +static internal_fn > > +vect_get_transformed_pattern (internal_fn ifn, slp_tree /* node */) { > > + switch (ifn) > > + { > > + case IFN_ADDSUB: > > + ifn = IFN_COMPLEX_ADD_ROT270; > > + break; > > + case IFN_SUBADD: > > + ifn = IFN_COMPLEX_ADD_ROT90; > > + break; > > + case IFN_MUL_ADDSUB: > > + ifn = IFN_COMPLEX_MUL_CONJ; > > + break; > > + case IFN_MUL_SUBADD: > > + ifn = IFN_COMPLEX_MUL; > > + break; > > + default: > > + gcc_unreachable (); > > + } > > + > > + return ifn; > > +} > > + > > + > > +static load_permutation_t > > +vect_reverse_perm (load_permutation_t perm) { > > + auto_vec<std::pair<unsigned, unsigned> > pairs; > > + pairs.create (perm.length ()); > > + unsigned i; > > + for (i = 0; i < perm.length (); i++) > > + pairs.quick_push (std::make_pair (i, perm[i])); > > + > > + typedef const std::pair<unsigned, unsigned>* cmp_t; pairs.qsort > > + ([](const void* a, const void* b) -> int > > + { return (int)((cmp_t)a)->second - (int)((cmp_t)b)->second; }); > > + > > + load_permutation_t rev_perm; > > + rev_perm.create (perm.length ()); > > + for (i = 0; i < perm.length (); i++) > > + rev_perm.quick_push (pairs[i].first); > > + > > + return rev_perm; > > +} > > + > > /* Optimize the SLP graph of VINFO. */ > > > > void > > @@ -3578,11 +3699,13 @@ vect_optimize_slp (vec_info *vinfo) > > > > auto_sbitmap n_visited (vertices.length ()); > > auto_sbitmap n_materialize (vertices.length ()); > > + auto_sbitmap n_transform (vertices.length ()); > > auto_vec<int> n_perm (vertices.length ()); > > auto_vec<vec<unsigned> > perms; > > > > bitmap_clear (n_visited); > > bitmap_clear (n_materialize); > > + bitmap_clear (n_transform); > > n_perm.quick_grow_cleared (vertices.length ()); > > perms.safe_push (vNULL); /* zero is no permute */ > > > > @@ -3602,55 +3725,73 @@ vect_optimize_slp (vec_info *vinfo) > > as entries to the reverse graph. */ > > if (!slpg->vertices[idx].succ) > > bitmap_set_bit (n_visited, idx); > > - /* Loads are the only thing generating permutes. */ > > - if (!SLP_TREE_LOAD_PERMUTATION (node).exists ()) > > - continue; > > > > - /* If splitting out a SLP_TREE_LANE_PERMUTATION can make the > > - node unpermuted, record this permute. */ > > - stmt_vec_info dr_stmt = SLP_TREE_REPRESENTATIVE (node); > > - if (!STMT_VINFO_GROUPED_ACCESS (dr_stmt)) > > - continue; > > - dr_stmt = DR_GROUP_FIRST_ELEMENT (dr_stmt); > > - unsigned imin = DR_GROUP_SIZE (dr_stmt) + 1, imax = 0; > > - bool any_permute = false; > > - for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j) > > + int perm_seed = 0; > > + > > + for (graph_edge *pred = slpg->vertices[idx].pred; > > + pred; pred = pred->pred_next) > > { > > - unsigned idx = SLP_TREE_LOAD_PERMUTATION (node)[j]; > > - imin = MIN (imin, idx); > > - imax = MAX (imax, idx); > > - if (idx - SLP_TREE_LOAD_PERMUTATION (node)[0] != j) > > - any_permute = true; > > - } > > - /* If there's no permute no need to split one out. */ > > - if (!any_permute) > > - continue; > > - /* If the span doesn't match we'd disrupt VF computation, avoid > > - that for now. */ > > - if (imax - imin + 1 != SLP_TREE_LANES (node)) > > - continue; > > + int src_idx = pred->src; > > + slp_tree src_node = vertices[src_idx]; > > > > - /* For now only handle true permutes, like > > - vect_attempt_slp_rearrange_stmts did. This allows us to be lazy > > - when permuting constants and invariants keeping the permute > > - bijective. */ > > - auto_sbitmap load_index (SLP_TREE_LANES (node)); > > - bitmap_clear (load_index); > > - for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j) > > - bitmap_set_bit (load_index, SLP_TREE_LOAD_PERMUTATION > (node)[j] - imin); > > - unsigned j; > > - for (j = 0; j < SLP_TREE_LANES (node); ++j) > > - if (!bitmap_bit_p (load_index, j)) > > - break; > > - if (j != SLP_TREE_LANES (node)) > > - continue; > > + /* Loads are the only thing generating permutes, however the > permutes > > + are not yet lowered into the loads. So instead chase up the chain > > + to find a permute node. */ > > + if (!SLP_TREE_LANE_PERMUTATION (src_node).exists ()) > > + continue; > > + > > + /* If splitting out a SLP_TREE_LANE_PERMUTATION can make the > > + node unpermuted, record this permute. */ > > + stmt_vec_info dr_stmt = SLP_TREE_REPRESENTATIVE (node); > > + if (!STMT_VINFO_GROUPED_ACCESS (dr_stmt)) > > + continue; > > + dr_stmt = DR_GROUP_FIRST_ELEMENT (dr_stmt); > > + unsigned imin = DR_GROUP_SIZE (dr_stmt) + 1, imax = 0; > > + bool any_permute = false; > > + for (unsigned j = 0; j < SLP_TREE_LANES (src_node); ++j) > > + { > > + unsigned src_op = SLP_TREE_LANE_PERMUTATION > (src_node)[j].first; > > + unsigned idx = SLP_TREE_LANE_PERMUTATION > (src_node)[j].second; > > + /* This isn't a load permute moved out. */ > > + if (src_op != 0) > > + continue; > > + imin = MIN (imin, idx); > > + imax = MAX (imax, idx); > > + if (idx - SLP_TREE_LANE_PERMUTATION (src_node)[0].second != j) > > + any_permute = true; > > + } > > + /* If there's no permute no need to split one out. */ > > + if (!any_permute) > > + continue; > > + /* If the span doesn't match we'd disrupt VF computation, avoid > > + that for now. */ > > + if (imax - imin + 1 != SLP_TREE_LANES (node)) > > + continue; > > + > > + /* For now only handle true permutes, like > > + vect_attempt_slp_rearrange_stmts did. This allows us to be lazy > > + when permuting constants and invariants keeping the permute > > + bijective. */ > > + auto_sbitmap load_index (SLP_TREE_LANES (src_node)); > > + bitmap_clear (load_index); > > + for (unsigned j = 0; j < SLP_TREE_LANES (src_node); ++j) > > + bitmap_set_bit (load_index, SLP_TREE_LANE_PERMUTATION > (src_node)[j].second - imin); > > + unsigned j; > > + for (j = 0; j < SLP_TREE_LANES (src_node); ++j) > > + if (!bitmap_bit_p (load_index, j)) > > + break; > > + if (j != SLP_TREE_LANES (src_node)) > > + continue; > > > > - vec<unsigned> perm = vNULL; > > - perm.safe_grow (SLP_TREE_LANES (node), true); > > - for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j) > > - perm[j] = SLP_TREE_LOAD_PERMUTATION (node)[j] - imin; > > - perms.safe_push (perm); > > - n_perm[idx] = perms.length () - 1; > > + vec<unsigned> perm = vNULL; > > + perm.safe_grow (SLP_TREE_LANES (src_node), true); > > + for (unsigned j = 0; j < SLP_TREE_LANES (src_node); ++j) > > + perm[j] = SLP_TREE_LANE_PERMUTATION (src_node)[j].second - > imin; > > + perms.safe_push (perm); > > + n_perm[src_idx] = perms.length () - 1; > > + perm_seed = -1; > > + } > > + n_perm[idx] = perm_seed; > > } > > > > /* Propagate permutes along the graph and compute materialization > > points. */ @@ -3667,12 +3808,12 @@ vect_optimize_slp (vec_info *vinfo) > > slp_tree node = vertices[idx]; > > /* For leafs there's nothing to do - we've seeded permutes > > on those above. */ > > - if (SLP_TREE_DEF_TYPE (node) != vect_internal_def) > > + if (SLP_TREE_DEF_TYPE (node) != vect_internal_def || > > +!slpg->vertices[idx].succ) > > continue; > > > > bitmap_set_bit (n_visited, idx); > > > > - /* We cannot move a permute across a store. */ > > + /* We cannot move a permute across a store. TODO: or a call. */ > > if (STMT_VINFO_DATA_REF (SLP_TREE_REPRESENTATIVE (node)) > > && DR_IS_WRITE > > (STMT_VINFO_DATA_REF (SLP_TREE_REPRESENTATIVE > (node)))) @@ > > -3723,6 +3864,12 @@ vect_optimize_slp (vec_info *vinfo) > > changed = true; > > } > > > > + /* Check to see if this node can be transformed during permute > > + materialization. */ > > + bool patt_trans_cand_p = is_pattern_convert_candidate_p (node); > > + if (patt_trans_cand_p) > > + bitmap_set_bit (n_transform, idx); > > + > > if (perm == 0) > > continue; > > > > @@ -3764,6 +3911,46 @@ vect_optimize_slp (vec_info *vinfo) > > } > > while (changed || iteration == 1); > > > > + bitmap_clear (n_visited); > > + > > + /* Transform and record any permutes for the usage of the instruction. > > + TODO: Check if it's cheaper to keep ADDSUB if available or use to > > + COMPLEX_ADD based on the reverse permute that's needed. */ > > + sbitmap_iterator bi; > > + EXECUTE_IF_SET_IN_BITMAP (n_transform, 0, i, bi) > > + { > > + slp_tree node = vertices[i]; > > + > > + for (graph_edge *succ = slpg->vertices[i].succ; > > + succ; succ = succ->succ_next) > > + { > > + int perm_id = n_perm[succ->dest]; > > + if (perm_id <= 0) > > + continue; > > + > > + load_permutation_t linperm = vect_reverse_perm > (perms[perm_id]); > > + perms[perm_id].release (); > > + perms[perm_id] = linperm; > > + > > + bitmap_set_bit (n_materialize, succ->dest); > > + } > > + > > + /* Transform the node if we have determined that it's beneficial to > > do > so. > > + when we do so the perm that has been calculated for the children of > the > > + node will be applied. */ > > + stmt_vec_info stmt_info = SLP_TREE_REPRESENTATIVE (node); > > + gimple* stmt = STMT_VINFO_STMT (stmt_info); > > + internal_fn old_ifn = gimple_call_internal_fn (stmt); > > + internal_fn ifn = vect_get_transformed_pattern (old_ifn ,node); > > + > > + if (dump_enabled_p ()) > > + dump_printf_loc (MSG_NOTE, vect_location, > > + "Tranforming SLP expression from %s to %s\n", > > + internal_fn_name (old_ifn), internal_fn_name (ifn)); > > + gimple_call_set_internal_fn (as_a <gcall *> (stmt), ifn); > > + } > > + > > + > > /* Materialize. */ > > for (i = 0; i < vertices.length (); ++i) > > { > > @@ -3798,6 +3985,11 @@ vect_optimize_slp (vec_info *vinfo) > > SLP_TREE_SCALAR_OPS (child), true); > > } > > > > + if (dump_enabled_p ()) > > + dump_printf_loc (MSG_NOTE, vect_location, > > + "processing node %p\n", > > + node); > > + > > if (bitmap_bit_p (n_materialize, i)) > > { > > if (SLP_TREE_LOAD_PERMUTATION (node).exists ()) @@ -3986,6 > > +4178,16 @@ vect_optimize_slp (vec_info *vinfo) > > } > > } > > } > > + > > + if (dump_enabled_p ()) > > + { > > + dump_printf_loc (MSG_NOTE, vect_location, "Optimized SLP instances: > \n"); > > + hash_set<slp_tree> visited; > > + slp_instance instance; > > + FOR_EACH_VEC_ELT (vinfo->slp_instances, i, instance) > > + vect_print_slp_graph (MSG_NOTE, vect_location, > > + SLP_INSTANCE_TREE (instance), visited); > > + } > > } > > > > /* Gather loads reachable from the individual SLP graph entries. */ > > > > > > > > -- > Richard Biener <rguent...@suse.de> > SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 > Nuernberg, Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)