https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102294
--- Comment #31 from H.J. Lu <hjl.tools at gmail dot com> --- (In reply to Mateusz Guzik from comment #30) > (In reply to H.J. Lu from comment #29) > > (In reply to Mateusz Guzik from comment #28) > > > (In reply to H.J. Lu from comment #27) > > > > (In reply to Mateusz Guzik from comment #26) > > > > > 4 stores per loop is best > > > > > > > > Do you have data to show it? > > > > > > I used to, but I'm out of this game. > > > > > > However, this is what gcc is already emitting if you explicitly ask it for > > > unrolled loops, so I don't think this bit should be controversial. > > > > It is hard to believ 8 stores slower than a loop. > > I once more point out I'm discussing the case of *no* simd usage. > > With the example of 128 bytes I provided, that would be 16 stores. > > I also claim punting to libcall is ok past 256 bytes, with regular stores > otherwise (no rep prefix). For 256 in particular that's 32 mov instructions. > > At some point this is an i-cache footprint tradeoff. > > Stock gcc already decides to do 4 stores loops if asked to refrain from > using rep mov/stos, so I don't see why anybody would protest sticking to > that specific size. Please describe your suggestions in "stores", not "bytes".