https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68707
--- Comment #22 from rguenther at suse dot de <rguenther at suse dot de> --- On Thu, 17 Dec 2015, alalaw01 at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68707 > > --- Comment #21 from alalaw01 at gcc dot gnu.org --- > Here's the smallest testcase I could come up with (where SLP gets cancelled, > but we end up with fewer st2's than before)...the key seems to be things being > used in multiple places. > > #define N 1024 > > int in1[N], in2[N]; > int out1[N], out2[N]; > int w[N]; > > void foo() { > for (int i = 0; i < N; i+=2) > { > int a = in1[i] & in2[i]; > int b = in1[i+1] & in2[i+1]; > out1[i] = a; > out1[i+1] = b; > out2[i] = (a + w[i]) ^ (b+w[i+1]); > out2[i+1] = (b + w[i]) ^ (a+w[i+1]); > } > } With discarding the instance for out2 we end up using both SLP and load-lanes for in1[], in2[] and w[] which ends up making those loads redundantly. The code .L2: ld2 {v16.4s - v17.4s}, [x2] add x0, x0, 32 ld2 {v18.4s - v19.4s}, [x1] ld2 {v2.4s - v3.4s}, [x3], 32 ldr q5, [x2, 16] and v1.16b, v18.16b, v16.16b and v0.16b, v19.16b, v17.16b cmp x5, x3 ldr q21, [x1, 16] ldr q4, [x2], 32 ldr q20, [x1], 32 add v17.4s, v0.4s, v3.4s add v16.4s, v1.4s, v2.4s add v0.4s, v0.4s, v2.4s add v1.4s, v1.4s, v3.4s and v5.16b, v5.16b, v21.16b and v4.16b, v4.16b, v20.16b eor v6.16b, v17.16b, v16.16b eor v7.16b, v1.16b, v0.16b str q5, [x0, -16] str q4, [x0, -32] st2 {v6.4s - v7.4s}, [x4], 32 bne .L2 ret looks as expected - load/store-lanes for out2 and regular vector loads for out1. We need to unroll 4 times for load-lanes vs. doing only SLP which requires only 2 times unrolling (I wondered if load-lanes would require more unrolling we should prefer SLP anyway?). With SLP .L2: ldr q3, [x0, x5] ldr q2, [x0, x4] ldr q0, [x0, x2] rev64 v4.4s, v3.4s rev64 v1.4s, v2.4s and v2.16b, v2.16b, v3.16b trn2 v3.4s, v0.4s, v0.4s trn1 v0.4s, v0.4s, v0.4s and v1.16b, v1.16b, v4.16b str q2, [x0, x3] add v0.4s, v0.4s, v2.4s add v1.4s, v1.4s, v3.4s eor v0.16b, v1.16b, v0.16b str q0, [x0, x1] add x0, x0, 16 cmp x0, 4096 bne .L2 indeed looks much nicer because it can share the actual loads and ANDs (the vectorizer doesn't and doesn't know that, only later CSE does, so we can't use that "knowledge" for our decision). But note there are no st2 remaining (as expected for a full SLP vectorization). Such cases were why I was thinking to apply this logic to all SLP instances at once (but then without knowing about actual sharing we have to wheight N non-permuted SLP instances against M (partially) permuted SLP instances). We can build up load sharing information after we built all SLP instances and then do some decision based on that (this would also make it possible to improve SLP costing in general which simply assumes no later CSE will happen). Hybrid SLP stmts simply mean that we not only need to have different permutations of loads but we also have to duplicate shared compute stmts (as elements are permuted differently). In this case the ANDs for computing a and b. Another improvement possibility for the current heuristic would be to query whether _all_ loads need to be permuted with SLP, like with if (STMT_VINFO_STRIDED_P (stmt_vinfo) || ! SLP_TREE_LOAD_PERMUTATION (load_node).exists () || ! vect_load_lanes_supported (STMT_VINFO_VECTYPE (stmt_vinfo), GROUP_SIZE (stmt_vinfo))) break; thus if there is a load node which is not permuted then retain the SLP. This would preserve the SLP for out2 because there are unpermuted uses of a and b. It would also preserve the SLP if you comment the out1 stores thus generate .L2: ldr q3, [x0, x4] ldr q2, [x0, x3] ldr q0, [x0, x2] rev64 v5.4s, v3.4s rev64 v1.4s, v2.4s and v2.16b, v2.16b, v3.16b trn2 v4.4s, v0.4s, v0.4s trn1 v0.4s, v0.4s, v0.4s and v1.16b, v1.16b, v5.16b add v0.4s, v0.4s, v2.4s add v1.4s, v1.4s, v4.4s eor v0.16b, v1.16b, v0.16b str q0, [x0, x1] add x0, x0, 16 cmp x0, 4096 bne .L2 instead of .L2: ld2 {v6.4s - v7.4s}, [x2], 32 ld2 {v16.4s - v17.4s}, [x1], 32 ld2 {v2.4s - v3.4s}, [x3], 32 and v1.16b, v16.16b, v6.16b and v0.16b, v17.16b, v7.16b add v6.4s, v1.4s, v2.4s add v7.4s, v0.4s, v3.4s add v1.4s, v1.4s, v3.4s add v0.4s, v0.4s, v2.4s eor v4.16b, v7.16b, v6.16b eor v5.16b, v1.16b, v0.16b st2 {v4.4s - v5.4s}, [x0], 32 cmp x4, x0 bne .L2 which doesn't look like the best decision (the load/store-lane loop does twice the work in about the same amount of code). Again, if we knew that we use in1/in2 with two different permutations we could do a more informed decision. So not sure where to go from here. I fear that to get a better heuristic than what is proposed we need to push this for example to vect_make_slp_decision where all instances are built and we'd need to gather some sharing data therein. But then there is only a small step to the point where we could actually compare SLP vs. non-SLP costs.