On Thu, Apr 04, 2024 at 04:28:58PM +1300, David Rowley wrote: > On Thu, 4 Apr 2024 at 11:50, Nathan Bossart <nathandboss...@gmail.com> wrote: >> If we can verify this approach won't cause segfaults and can stomach the >> regression between 8 and 16 bytes, I'd happily pivot to this approach so >> that we can avoid the function call dance that I have in v25. > > If we're worried about regressions with some narrow range of byte > values, wouldn't it make more sense to compare that to cc4826dd5~1 at > the latest rather than to some version that's already probably faster > than PG16?
Good point. When compared with REL_16_STABLE, Ants's idea still wins: bytes v25 v25+ants REL_16_STABLE 2 1108.205 1033.132 2039.342 4 1311.227 1289.373 3207.217 8 1927.954 2360.113 3200.238 16 2281.091 2365.408 4457.769 32 3856.992 2390.688 6206.689 64 3648.72 3242.498 9619.403 128 4108.549 3607.148 17912.081 256 4910.076 4496.852 33591.385 As before, with 2 and 4 bytes, HEAD is using the inlined approach, but REL_16_STABLE is doing a function call. For 8 bytes, REL_16_STABLE is doing a function call as well as a call to a function pointer. At 16 bytes, it's doing a function call and two calls to a function pointer. With Ant's approach, both 8 and 16 bytes require a single call to a function pointer, and of course we are using the AVX-512 implementation for both. I think this is sufficient to justify switching approaches. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com