https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119960
--- Comment #6 from Arseny Kapoulkine <arseny.kapoulkine at gmail dot com> --- Thanks for the analysis! Just a note to make sure I didn't misunderstand this, are you saying you see gcc 15 vectorizing the stores when using znver4 tuning? I tried this and it did not do that; using either g++-15.0 or latest g++ master with -O3 -DNDEBUG -mtune=znver4. Note: this is *without* -march=znver4; with -march=znver4 I see a further regression (with or without mtune) to 4.0 GB/s; the resulting assembly still has separate 32-bit stores and a combined 64-bit load, but on top of that it looks like that results in one of the pair elements being extracted into a GPR which probably costs extra - without arch specific tuning, g++15 keeps the loaded pair in XMM register and only uses vector ops on it.