https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102294

--- Comment #31 from H.J. Lu <hjl.tools at gmail dot com> ---
(In reply to Mateusz Guzik from comment #30)
> (In reply to H.J. Lu from comment #29)
> > (In reply to Mateusz Guzik from comment #28)
> > > (In reply to H.J. Lu from comment #27)
> > > > (In reply to Mateusz Guzik from comment #26)
> > > > > 4 stores per loop is best
> > > > 
> > > > Do you have data to show it?
> > > 
> > > I used to, but I'm out of this game.
> > > 
> > > However, this is what gcc is already emitting if you explicitly ask it for
> > > unrolled loops, so I don't think this bit should be controversial.
> > 
> > It is hard to believ 8 stores slower than a loop.
> 
> I once more point out I'm discussing the case of *no* simd usage.
> 
> With the example of 128 bytes I provided, that would be 16 stores.
> 
> I also claim punting to libcall is ok past 256 bytes, with regular stores
> otherwise (no rep prefix). For 256 in particular that's 32 mov instructions.
> 
> At some point this is an i-cache footprint tradeoff.
> 
> Stock gcc already decides to do 4 stores loops if asked to refrain from
> using rep mov/stos, so I don't see why anybody would protest sticking to
> that specific size.

Please describe your suggestions in "stores", not "bytes".

Reply via email to