[Bug target/118310] Poorly optimized trivial integer serialization due to vectorizer on x86_64

rguenth at gcc dot gnu.org via Gcc-bugs Tue, 07 Jan 2025 03:58:24 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118310


Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jakub at gcc dot gnu.org,
                   |                            |rguenth at gcc dot gnu.org

--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
We're vectorizing this as

  _75 = {_1, _3, _5, _7, _9, _11, _13, _15, _16, _18, _20, _22, _24, _26, _28,
_30};
  vectp.4_76 = dst_33(D);
  MEM <vector(16) unsigned char> [(unsigned char *)vectp.4_76] = _75;

and cost-wise this is caused by high cost of scalar vs vector stores and
not enough pessimized cost of the vector construction:

_1 1 times scalar_store costs 12 in body
... (16 times)
_1 1 times unaligned_store (misalign -1) costs 12 in body
node 0x2b8f9f00 1 times vec_construct costs 156 in prologue
t.c:5:12: note: Cost model analysis for part in loop 0:
  Vector cost: 180
  Scalar cost: 192
t.c:5:12: note: Basic block will be vectorized using SLP

without vectorization store-merging detects this as noop move.

Vectorization is confused by some patterns and the low >> 0 shift which
is elided.  For the vectorizer having dst[] = BIT_FIELD_REF <low, ...>
and dst[] = BIT_FIELD_REF <high, ...> would have been the better
representation (it still wouldn't be directly suported).

I suppose store-merging and BB vectorization should run at the same time
and be costed against each other.

Alternatively pattern recognition could recognize a BIT_FIELD_REF as well
(so could differently done SLP discovery).

So I suppose when BB vectorizing a store group we could use
store-mergings process_store () and terminate_and_process_all_chains ()
and always prefer store-merging (but while vectorization considers the
whole function, store-merging works a store-group at a time).

store-merging is also set up to fix up some cases of "bad" vectorization
via maybe_optimize_vector_constructor, but that seems to only consider
bswaps, not 1:1 copies from two sources as seen here.  We're not running
on the BB with the single store and the CTOR because the vector size is
128 and we only handle 16, 32 or 64, so we likely do not consider splitting
the store - doing that in store-merging might be difficult.

One could argue that we should do some basic store-merging earlier (at bswap
time) as well.  At least the cases of store from bswap or nop-move like we
have here.

[Bug target/118310] Poorly optimized trivial integer serialization due to vectorizer on x86_64

Reply via email to