On 04/12/15 16:04, Richard Biener wrote: > On December 4, 2015 4:32:33 PM GMT+01:00, Alan Lawrence > <alan.lawre...@arm.com> wrote: >> On 27/11/15 08:30, Richard Biener wrote: >>> >>> This is part 1 of a fix for PR68533 which shows that some targets >>> cannot can_vec_perm_p on an identity permutation. I chose to fix >>> this in the vectorizer by detecting the identity itself but with >>> the current structure of vect_transform_slp_perm_load this is >>> somewhat awkward. Thus the following no-op patch simplifies it >>> greatly (from the times it was restricted to do interleaving-kind >>> of permutes). It turned out to not be 100% no-op as we now can >>> handle non-adjacent source operands so I split it out from the >>> actual fix. >>> >>> The two adjusted testcases no longer fail to vectorize because >>> of "need three vectors" but unadjusted would fail because there >>> are simply not enough scalar iterations in the loop. I adjusted >>> that and now we vectorize it just fine (running into PR68559 >>> which I filed). >>> >>> Bootstrapped and tested on x86_64-unknown-linux-gnu, applied. >>> >>> Richard. >>> >>> 2015-11-27 Richard Biener <rguent...@suse.de> >>> >>> PR tree-optimization/68553 >>> * tree-vect-slp.c (vect_get_mask_element): Remove. >>> (vect_transform_slp_perm_load): Implement in a simpler way. >>> >>> * gcc.dg/vect/pr45752.c: Adjust. >>> * gcc.dg/vect/slp-perm-4.c: Likewise. >> >> On aarch64 and ARM targets, this causes >> >> PASS->FAIL: gcc.dg/vect/O3-pr36098.c scan-tree-dump-times vect >> "vectorizing >> stmts using SLP" 0 >> >> That is, we now vectorize using SLP, when previously we did not. >> >> On aarch64 (and I expect ARM too), previously we used a VEC_LOAD_LANES, >> without >> unrolling, > but now we unroll * 4, and vectorize using 3 loads and >> permutes: > > Happens on x86_64 as well with at least Sse4.1. Unfortunately we'll have to > start introducing much more fine-grained target-supports for vect_perm to > reliably guard all targets.
I don't know enough about SSE4.1 to know whether it's a problem there or not. This is an actual regression on AArch64 and ARM and not just a testism, you now get : .L5: ldr q0, [x5, 16] add x4, x4, 48 ldr q1, [x5, 32] add w6, w6, 1 ldr q4, [x5, 48] cmp w3, w6 ldr q2, [x5], 64 orr v3.16b, v0.16b, v0.16b orr v5.16b, v4.16b, v4.16b orr v4.16b, v1.16b, v1.16b tbl v0.16b, {v0.16b - v1.16b}, v6.16b tbl v2.16b, {v2.16b - v3.16b}, v7.16b tbl v4.16b, {v4.16b - v5.16b}, v16.16b str q0, [x4, -32] str q2, [x4, -48] str q4, [x4, -16] bhi .L5 instead of .L5: ld4 {v4.4s - v7.4s}, [x7], 64 add w4, w4, 1 cmp w3, w4 orr v1.16b, v4.16b, v4.16b orr v2.16b, v5.16b, v5.16b orr v3.16b, v6.16b, v6.16b st3 {v1.4s - v3.4s}, [x6], 48 bhi .L5 LD4 and ST3 do all the permutes without needing actual permute instructions - a strategy that favours generic permutes avoiding the load_lanes case is likely to be more expensive on most implementations. I think worth a PR atleast. regards Ramana > > Richard. > >> ../gcc/gcc/testsuite/gcc.dg/vect/O3-pr36098.c:15:2: note: add new stmt: >> >> vect__31.15_94 = VEC_PERM_EXPR <vect__31.11_87, vect__31.12_89, { 0, 1, >> 2, 4 }>; >> ../gcc/gcc/testsuite/gcc.dg/vect/O3-pr36098.c:15:2: note: add new stmt: >> >> vect__31.16_95 = VEC_PERM_EXPR <vect__31.12_89, vect__31.13_91, { 1, 2, >> 4, 5 }>; >> ../gcc/gcc/testsuite/gcc.dg/vect/O3-pr36098.c:15:2: note: add new stmt: >> >> vect__31.17_96 = VEC_PERM_EXPR <vect__31.13_91, vect__31.14_93, { 2, 4, >> 5, 6 }> >> >> which *is* a valid vectorization strategy... >> >> >> --Alan > >