https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102294
--- Comment #29 from H.J. Lu <hjl.tools at gmail dot com> --- (In reply to Mateusz Guzik from comment #28) > (In reply to H.J. Lu from comment #27) > > (In reply to Mateusz Guzik from comment #26) > > > 4 stores per loop is best > > > > Do you have data to show it? > > I used to, but I'm out of this game. > > However, this is what gcc is already emitting if you explicitly ask it for > unrolled loops, so I don't think this bit should be controversial. It is hard to believ 8 stores slower than a loop.