[Bug tree-optimization/117031] increasing VF during SLP vectorization permutes unnecessarily

rguenth at gcc dot gnu.org via Gcc-bugs Wed, 09 Oct 2024 01:04:31 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117031


--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Tamar Christina from comment #3)
> (In reply to Richard Biener from comment #2)
> > (In reply to Tamar Christina from comment #0)
> > > GCC seems to miss that there is no gap between the group accesses and that
> > > stride == 1.
> > > test3 is vectorized linearly by GCC, so it seems this is missed 
> > > optimization
> > > in data ref analysis?
> > 
> > The load-lanes look fine, so it must be the code generation for the
> > HI to DI via SI conversions using unpacks you are complaining about?
> > 
> 
> No, that one I have a patch for.
> 
> > Using load-lanes is natural here.
> > 
> > This isn't about permutes due to VF or so, isn't it?
> 
> It is, the load lanes is unnecessary, because there is no permute during the
> loop because the group size is equal to the stride and offsets are linear.
> 
> LOAD_LANES are really expensive, especially 4 register ones.
> 
> My complaint is that this loop, does not have a permute.  While it may look
> like the entries are permuted they are not.
> 
> essentially test1 and test3 are the same. the vectorizer picks VF=8, so
> unrolls test1 into test3, but fails to see that the unrolled code is linear,
> but when manually unrolled it does:
> 
> e.g.
> 
> void
> test3 (unsigned short *x, double *y, int n)
> {
>     for (int i = 0; i < n; i+=2)
>         {
>             unsigned short a1 = x[i * 4 + 0];
>             unsigned short b1 = x[i * 4 + 1];
>             unsigned short c1 = x[i * 4 + 2];
>             unsigned short d1 = x[i * 4 + 3];
>             y[i+0] = (double)a1 + (double)b1 + (double)c1 + (double)d1;
>             unsigned short a2 = x[(i + 1) * 4 + 0];
>             unsigned short b2 = x[(i + 1) * 4 + 1];
>             unsigned short c2 = x[(i + 1) * 4 + 2];
>             unsigned short d2 = x[(i + 1) * 4 + 3];
>             y[i+1] = (double)a2 + (double)b2 + (double)c2 + (double)d2;
>         }
> }
> 
> does not use LOAD_LANES.

It uses interleaving because there's no ld8 and when
vect_lower_load_permutations decides heuristically to use load-lanes it
tries to do so vector-size agnostic so it doesn't consider using two times
ld4.

There _are_ permutes because of the use of 4 lanes to compute the single
lane store in the reduction operation.  The vectorization for the unrolled
loop not using load-lanes show them:

  vect_a1_53.10_234 = MEM <vector(8) short unsigned int> [(short unsigned int
*)vectp_x.8_232];
  vectp_x.8_235 = vectp_x.8_232 + 16;
  vect_a1_53.11_236 = MEM <vector(8) short unsigned int> [(short unsigned int
*)vectp_x.8_235];
  vectp_x.8_237 = vectp_x.8_232 + 32;
  vect_a1_53.12_238 = MEM <vector(8) short unsigned int> [(short unsigned int
*)vectp_x.8_237];
  vectp_x.8_239 = vectp_x.8_232 + 48;
  vect_a1_53.13_240 = MEM <vector(8) short unsigned int> [(short unsigned int
*)vectp_x.8_239];
  _254 = VEC_PERM_EXPR <vect_a1_53.10_234, vect_a1_53.11_236, { 1, 3, 5, 7, 9,
11, 13, 15 }>;
  _255 = VEC_PERM_EXPR <vect_a1_53.12_238, vect_a1_53.13_240, { 1, 3, 5, 7, 9,
11, 13, 15 }>;
  _286 = VEC_PERM_EXPR <_254, _255, { 1, 3, 5, 7, 9, 11, 13, 15 }>;
...

that's simply load-lanes open-coded.  If open-coding ld4 is better than using
ld4 just make it not available to the vectorizer?  Similar to ld2 I suppose.

[Bug tree-optimization/117031] increasing VF during SLP vectorization permutes unnecessarily

Reply via email to