On Tue, Mar 19, 2024 at 9:03 AM Nathan Bossart <nathandboss...@gmail.com> wrote: > > On Sun, Mar 17, 2024 at 09:47:33AM +0700, John Naylor wrote: > > I haven't looked at the patches, but the graphs look good. > > I spent some more time on these patches. Specifically, I reordered them to > demonstrate the effects on systems without AVX2 support. I've also added a > shortcut to jump to the one-by-one approach when there aren't many > elements, as the overhead becomes quite noticeable otherwise. Finally, I > ran the same benchmarks again on x86 and Arm out to 128 elements. > > Overall, I think 0001 and 0002 are in decent shape, although I'm wondering > if it's possible to improve the style a bit.
I took a brief look, and 0001 isn't quite what I had in mind. I can't quite tell what it's doing with the additional branches and "goto retry", but I meant something pretty simple: - if short, do one element at a time and return - if long, do one block unconditionally, then round the start pointer up so that "end - start" is an exact multiple of blocks, and loop over them