https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102294
--- Comment #30 from Mateusz Guzik <mjguzik at gmail dot com> --- (In reply to H.J. Lu from comment #29) > (In reply to Mateusz Guzik from comment #28) > > (In reply to H.J. Lu from comment #27) > > > (In reply to Mateusz Guzik from comment #26) > > > > 4 stores per loop is best > > > > > > Do you have data to show it? > > > > I used to, but I'm out of this game. > > > > However, this is what gcc is already emitting if you explicitly ask it for > > unrolled loops, so I don't think this bit should be controversial. > > It is hard to believ 8 stores slower than a loop. I once more point out I'm discussing the case of *no* simd usage. With the example of 128 bytes I provided, that would be 16 stores. I also claim punting to libcall is ok past 256 bytes, with regular stores otherwise (no rep prefix). For 256 in particular that's 32 mov instructions. At some point this is an i-cache footprint tradeoff. Stock gcc already decides to do 4 stores loops if asked to refrain from using rep mov/stos, so I don't see why anybody would protest sticking to that specific size.