https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103554

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |crazylht at gmail dot com,
                   |                            |rguenth at gcc dot gnu.org
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2021-12-06
             Blocks|                            |53947
     Ever confirmed|0                           |1
             Target|                            |x86_64-*-*

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
The issue is

t.ii:8:11: note:   ==> examining statement: _1 = in.d;
t.ii:8:11: missed:   BB vectorization with gaps at the end of a load is not
supported
t.ii:8:16: missed:   not vectorized: relevant stmt not supported: _1 = in.d;
t.ii:8:11: note:   Building vector operands of 0x447c768 from scalars instead

when trying to vectorize this with V4DI.  We don't realize that with a visible
decl we can load the gap.  Indeed I think I've seen this before as well.

Note with SSE the same issue is present but we create the V2DI vectors in
a more optimal way from scalars.  With the above issue fixed we'd instead
use two V4DI unaligned vector moves from the stack and a shuffle.

The locally optimal solution would be two unaligned V2DI loads and either
two V2DI stores or a V4DI merge and store.

_Note_ that likely the suboptimal solution presented here is faster because
it avoids STLF penalties from the calls stack setup which very likely uses
scalar or differently aligned vector moves.

Note the x86 backend costs the SSE variant

t.ii:8:11: note: Cost model analysis for part in loop 0:
  Vector cost: 40
  Scalar cost: 48

and the AVX variant

t.ii:8:11: note: Cost model analysis for part in loop 0:
  Vector cost: 48
  Scalar cost: 48

but the x86 backend chooses to not let the vectorizer compare costs with
different vector sizes but instead asks it to pick the first working
solution from the vector of modes to consider (and in that order).  We
might want to reconsider that (maybe at least for BB vectorization and
maybe with some extra special mode?).


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations

Reply via email to