https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119960

--- Comment #6 from Arseny Kapoulkine <arseny.kapoulkine at gmail dot com> ---
Thanks for the analysis! Just a note to make sure I didn't misunderstand this,
are you saying you see gcc 15 vectorizing the stores when using znver4 tuning?
I tried this and it did not do that; using either g++-15.0 or latest g++ master
with -O3 -DNDEBUG -mtune=znver4. Note: this is *without* -march=znver4; with
-march=znver4 I see a further regression (with or without mtune) to 4.0 GB/s;
the resulting assembly still has separate 32-bit stores and a combined 64-bit
load, but on top of that it looks like that results in one of the pair elements
being extracted into a GPR which probably costs extra - without arch specific
tuning, g++15 keeps the loaded pair in XMM register and only uses vector ops on
it.

Reply via email to