https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98138

--- Comment #3 from Kewen Lin <linkw at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #2)
> So the expected vectorization builds vectors
> 
>  { tmp[0][0], tmp[1][0], tmp[2][0], tmp[3][0] }
> 
> that's not SLP, SLP tries to build the
> 
>  { tmp[i][0], tmp[i][1], tmp[i][2], tmp[i][3] }
> 
> vector and "succeeds" - the SLP tree turns out to be
> highly inefficient though.  So for the stores your desire
> is to see an interleaving scheme with VF 4 (the number of
> iterations).  But interleaving fails because it would require
> a VF of 16 and there are not enough iteration in the loop.
> 
> The classical SLP scheme degenerates (also due to the plus/minus
> mixed ops) to uniform vectors as we venture beyond the a{0,2} {+,-} a{1,3}
> expression.
> 
> Starting SLP discovery from the grouped loads would get things going
> up to the above same expression.
> 
> So not sure what's the best approach to this case.  The testcase
> can be simplified still showing the SLP discovery issue:
> 
> extern void test(unsigned int t[4][4]);
> 
> void foo(int *p1, int i1, int *p2, int i2)
> {
>   unsigned int tmp[4][4];
>   unsigned int a0, a1, a2, a3;
> 
>   for (int i = 0; i < 4; i++, p1 += i1, p2 += i2) {
>       a0 = (p1[0] - p2[0]);
>       a1 = (p1[1] - p2[1]);
>       a2 = (p1[2] - p2[2]);
>       a3 = (p1[3] - p2[3]);
> 
>       int t0 = a0 + a1;
>       int t1 = a0 - a1;
>       int t2 = a2 + a3;
>       int t3 = a2 - a3;
> 
>       tmp[i][0] = t0 + t2;
>       tmp[i][2] = t0 - t2;
>       tmp[i][1] = t1 + t3;
>       tmp[i][3] = t1 - t3;
>   }
>   test(tmp);
> }
> 
> So it's basically SLP discovery degenerating to an interleaving scheme
> on the load side but not actually "implementing" it.

IIUC, in current implementation, we get four grouped stores:
  { tmp[i][0], tmp[i][1], tmp[i][2], tmp[i][3] } /i=0,1,2,3/ independently 

When all these tryings fail, could we do some re-try on the groups
  { tmp[0][i], tmp[1][i], tmp[2][i], tmp[3][i] } /i=0,1,2,3/
with one extra intermediate layer covering those original groups, then start
from these newly adjusted groups? the built operands should isomorphic then.
May be too hackish?

Reply via email to