https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104665
Andrew Pinski <pinskia at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|UNCONFIRMED |NEW Last reconfirmed| |2022-02-23 Severity|normal |enhancement Ever confirmed|0 |1 --- Comment #3 from Andrew Pinski <pinskia at gcc dot gnu.org> --- A couple of reasons. First is store merging happens too late. Second reason is store merging does not work in the loop case. Take: #if 1 enum b : unsigned char{}; #else typedef unsigned char b; #endif void serialize_le(b* __restrict dst, const unsigned* __restrict src) { // for (int i = 0; i < 128; ++i, ++src) { unsigned t = *src; *dst++ = static_cast<b>((t >> 0) & 0xff); *dst++ = static_cast<b>((t >> 8) & 0xff); *dst++ = static_cast<b>((t >> 16) & 0xff); *dst++ = static_cast<b>((t >> 24) & 0xff); } } This gets optimized to one load followed by one store. But once you add the loop, and use -fno-tree-vectorize (because GCC's vectorizer gets kicked in which causes other issues), the stores are not merged into one. Also store merging happens way after loop distrubution happens so ...