https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98138
--- Comment #3 from Kewen Lin <linkw at gcc dot gnu.org> --- (In reply to Richard Biener from comment #2) > So the expected vectorization builds vectors > > { tmp[0][0], tmp[1][0], tmp[2][0], tmp[3][0] } > > that's not SLP, SLP tries to build the > > { tmp[i][0], tmp[i][1], tmp[i][2], tmp[i][3] } > > vector and "succeeds" - the SLP tree turns out to be > highly inefficient though. So for the stores your desire > is to see an interleaving scheme with VF 4 (the number of > iterations). But interleaving fails because it would require > a VF of 16 and there are not enough iteration in the loop. > > The classical SLP scheme degenerates (also due to the plus/minus > mixed ops) to uniform vectors as we venture beyond the a{0,2} {+,-} a{1,3} > expression. > > Starting SLP discovery from the grouped loads would get things going > up to the above same expression. > > So not sure what's the best approach to this case. The testcase > can be simplified still showing the SLP discovery issue: > > extern void test(unsigned int t[4][4]); > > void foo(int *p1, int i1, int *p2, int i2) > { > unsigned int tmp[4][4]; > unsigned int a0, a1, a2, a3; > > for (int i = 0; i < 4; i++, p1 += i1, p2 += i2) { > a0 = (p1[0] - p2[0]); > a1 = (p1[1] - p2[1]); > a2 = (p1[2] - p2[2]); > a3 = (p1[3] - p2[3]); > > int t0 = a0 + a1; > int t1 = a0 - a1; > int t2 = a2 + a3; > int t3 = a2 - a3; > > tmp[i][0] = t0 + t2; > tmp[i][2] = t0 - t2; > tmp[i][1] = t1 + t3; > tmp[i][3] = t1 - t3; > } > test(tmp); > } > > So it's basically SLP discovery degenerating to an interleaving scheme > on the load side but not actually "implementing" it. IIUC, in current implementation, we get four grouped stores: { tmp[i][0], tmp[i][1], tmp[i][2], tmp[i][3] } /i=0,1,2,3/ independently When all these tryings fail, could we do some re-try on the groups { tmp[0][i], tmp[1][i], tmp[2][i], tmp[3][i] } /i=0,1,2,3/ with one extra intermediate layer covering those original groups, then start from these newly adjusted groups? the built operands should isomorphic then. May be too hackish?