[Bug tree-optimization/119181] Missed vectorization due to imperfect SLP discovery for 2 grouped load with same base pointer (taken as 1 interleaved load)

rguenth at gcc dot gnu.org via Gcc-bugs Wed, 12 Mar 2025 01:10:54 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119181


--- Comment #12 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Hongtao Liu from comment #10)
> But it still can't fix the issue with
> 
> void
> foo (int* a, int* restrict b)
> {
>     b[0] = a[0] * a[8];
>     b[1] = a[1] * a[9];
>     b[2] = a[2] * a[10];
>     b[3] = a[11] * a[3];
>     b[4] = a[12] * a[4];
>     b[5] = a[5] * a[13];
>     b[6] = a[6] * a[14];
>     b[7] = a[7] * a[15];
> }
> 
> -O2 -mavx2
> 
> foo:
>         vmovdqu ymm0, YMMWORD PTR [rdi]
>         vmovdqu ymm2, YMMWORD PTR [rdi+32]
>         vpblendd        ymm1, ymm2, ymm0, 231
>         vpblendd        ymm0, ymm0, ymm2, 231
>         vpmulld ymm0, ymm1, ymm0
>         vmovdqu YMMWORD PTR [rsi], ymm0
>         vzeroupper
>         ret
> 
> There's 2 redundant vpblendd here.

Yes, which is why I didn't try splitting groups - the most practical cases
will not have a large constant gap.  Instead this asks for a optimization
phase on the SLP tree, possibly part of permute optimizations.

For vector code as in comment #11 this could be optimized by either a
match.pd pattern or by forwprop.  Note it could be deeper in an
expression tree, like permute * (x + permute), where eliding two
permutes in exchange for an additional permute on 'x' might pay off,
this shouldn't be done with match.pd or simple pattern matching but
would ask for some kind of propagation pass (like we do in SLP permute
optimization).

[Bug tree-optimization/119181] Missed vectorization due to imperfect SLP discovery for 2 grouped load with same base pointer (taken as 1 interleaved load)

Reply via email to