[Bug target/102294] memset expansion is sometimes slow for small sizes

mjguzik at gmail dot com via Gcc-bugs Fri, 06 Jun 2025 14:34:09 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102294


--- Comment #30 from Mateusz Guzik <mjguzik at gmail dot com> ---
(In reply to H.J. Lu from comment #29)
> (In reply to Mateusz Guzik from comment #28)
> > (In reply to H.J. Lu from comment #27)
> > > (In reply to Mateusz Guzik from comment #26)
> > > > 4 stores per loop is best
> > > 
> > > Do you have data to show it?
> > 
> > I used to, but I'm out of this game.
> > 
> > However, this is what gcc is already emitting if you explicitly ask it for
> > unrolled loops, so I don't think this bit should be controversial.
> 
> It is hard to believ 8 stores slower than a loop.

I once more point out I'm discussing the case of *no* simd usage.

With the example of 128 bytes I provided, that would be 16 stores.

I also claim punting to libcall is ok past 256 bytes, with regular stores
otherwise (no rep prefix). For 256 in particular that's 32 mov instructions.

At some point this is an i-cache footprint tradeoff.

Stock gcc already decides to do 4 stores loops if asked to refrain from using
rep mov/stos, so I don't see why anybody would protest sticking to that
specific size.

[Bug target/102294] memset expansion is sometimes slow for small sizes

Reply via email to