Hi Richi, > -----Original Message----- > From: rguent...@c653.arch.suse.de <rguent...@c653.arch.suse.de> On > Behalf Of Richard Biener > Sent: Wednesday, November 4, 2020 1:36 PM > To: Tamar Christina <tamar.christ...@arm.com> > Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; o...@ucw.cz > Subject: Re: [PATCH v2 10/18]middle-end simplify lane permutes which > selects from loads from the same DR. > > On Tue, 3 Nov 2020, Tamar Christina wrote: > > > Hi All, > > > > This change allows one to simplify lane permutes that select from > > multiple load leafs that load from the same DR group by promoting the > > VEC_PERM node into a load itself and pushing the lane permute into it as a > load permute. > > > > This saves us from having to calculate where to materialize a new load node. > > If the resulting loads are now unused they are freed and are removed > > from the graph. > > > > This allows us to handle cases where we would have generated: > > > > movi v4.4s, 0 > > adrp x3, .LC0 > > ldr q5, [x3, #:lo12:.LC0] > > mov x3, 0 > > .p2align 3,,7 > > .L2: > > mov v0.16b, v4.16b > > mov v3.16b, v4.16b > > ldr q1, [x1, x3] > > ldr q2, [x0, x3] > > fcmla v0.4s, v2.4s, v1.4s, #0 > > fcmla v3.4s, v1.4s, v2.4s, #0 > > fcmla v0.4s, v2.4s, v1.4s, #270 > > fcmla v3.4s, v1.4s, v2.4s, #270 > > mov v1.16b, v3.16b > > tbl v0.16b, {v0.16b - v1.16b}, v5.16b > > str q0, [x2, x3] > > add x3, x3, 16 > > cmp x3, 1600 > > bne .L2 > > ret > > > > and instead generate > > > > mov x3, 0 > > .p2align 3,,7 > > .L27: > > ldr q0, [x2, x3] > > ldr q1, [x0, x3] > > ldr q2, [x1, x3] > > fcmla v0.2d, v1.2d, v2.2d, #0 > > fcmla v0.2d, v1.2d, v2.2d, #270 > > str q0, [x2, x3] > > add x3, x3, 16 > > cmp x3, 512 > > bne .L27 > > ret > > > > This runs as a pre step such that permute simplification can still > > inspect this permute is needed > > > > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues. > > Tests are included as part of the final patch as they need the SLP > > pattern matcher to insert permutes in between. > > > > Ok for master? > > So I think this is too specialized for the general issue that we're doing a > bad > job in CSEing the load part of different permutes of the same group. I've > played with fixing this half a year ago (again) in multiple general ways but > they all caused some regressions. > > So you're now adding some heuristics as to when to anticipate "CSE" (or > merging with followup permutes). > > To quickly recap what I did consider two loads (V2DF) one { a[0], a[1] } and > the other { a[1], a[0] }. They currently are two SLP nodes and one with a > load_permutation. > My original attempts focused on trying to get rid of load_permutation in > favor of lane_permute nodes and thus during SLP discovery I turned the > second into { a[0], a[1] } (magically unified with the other load) and a > followup lane-permute node. > > So for your case you have IIUC { a[0], a[0] } and { a[1], a[1] } which > eventually > will (due to patterns) be lane-permuted into { a[0], a[1] }, right? So > generalizing this as a single { a[0], a[1] } plus two lane-permute nodes { > 0, 0 } > and { 1, 1 } early would solve the issue as well?
Correct, I did wonder why it was generating two different nodes instead of a lane permute but didn't pay much attention that it was just a short coming. > Now, in general it might be > more profitable to generate the { a[0], a[0] } and { a[1], a[1] } via > scalar-load- > and-splat rather than vector load and permute so we have to be careful to > not over-optimize here or be prepared to do the reverse transform. This in principle can be done in optimize_slp then right? Since it would do a lot of the same work already and find the materialization points. > > The patch itself is a bit ugly since it modifies the SLP graph when we already > produced the graphds graph so I would do any of this before. I did consider > gathering all loads nodes loading from a group and then trying to apply some > heuristic to alter the SLP graph so it can be better optimized. In fact when > we > want to generate the same code as the non-SLP interleaving scheme does > we do have to look at those since we have to unify loads there. > Yes.. I will concede the patch isn't my finest work.. I also don't like the fact that I had to keep leafs in tact less I break things later. But wanted feedback :) > I'd put this after vect_slp_build_vertices but before the new_graph call - > altering 'vertices' / 'leafs' should be more easily possible and the 'leafs' > array > contains all loads already (vect_slp_build_vertices could be massaged to > provide a map from DR_GROUP_FIRST_ELEMENT to slp_tree, giving us the > meta we want). > > That said, I'd like to see something more forward-looking rather than the ad- > hoc special-casing of what you run into with the pattern matching. > Yeah, I like your suggestion about doing it at build time and CSEing early, but don't think I can get that work in a week given that you've already tried multiple times :) Happy to give it a go next stage-1 opening though. > In case we want to still go with the special-casing it should IMHO be done in > a > pre-order walk simply looking for lane permute nodes with children that all > load from the same group performing what you do before any of the > vertices/graph stuff is built. That's probably easiest at this point and it > can be > done when then bst_map is still around so you can properly CSE the new > load you build. That's fair enough. I do think I need a temporary (not terrible) workaround...This would then need to be somewhere in vect_analyze_slp. Would you prefer I do it during the construction of the instance of afterwards? Regards, Tamar > > Thanks, > Richard. > > > > > Thanks, > > Tamar > > > > gcc/ChangeLog: > > > > * tree-vect-slp.c (vect_optimize_slp): Promote permutes. > > > > > > -- > Richard Biener <rguent...@suse.de> > SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 > Nuernberg, Germany; GF: Felix Imend