Re: Popcount optimization using AVX512

Nathan Bossart Thu, 04 Apr 2024 10:18:45 -0700

On Thu, Apr 04, 2024 at 04:28:58PM +1300, David Rowley wrote:
> On Thu, 4 Apr 2024 at 11:50, Nathan Bossart <nathandboss...@gmail.com> wrote:
>> If we can verify this approach won't cause segfaults and can stomach the
>> regression between 8 and 16 bytes, I'd happily pivot to this approach so
>> that we can avoid the function call dance that I have in v25.
> 
> If we're worried about regressions with some narrow range of byte
> values, wouldn't it make more sense to compare that to cc4826dd5~1 at
> the latest rather than to some version that's already probably faster
> than PG16?


Good point.  When compared with REL_16_STABLE, Ants's idea still wins:

  bytes  v25       v25+ants  REL_16_STABLE
      2  1108.205  1033.132  2039.342
      4  1311.227  1289.373  3207.217
      8  1927.954  2360.113  3200.238
     16  2281.091  2365.408  4457.769
     32  3856.992  2390.688  6206.689
     64  3648.72   3242.498  9619.403
    128  4108.549  3607.148  17912.081
    256  4910.076  4496.852  33591.385

As before, with 2 and 4 bytes, HEAD is using the inlined approach, but
REL_16_STABLE is doing a function call.  For 8 bytes, REL_16_STABLE is
doing a function call as well as a call to a function pointer.  At 16
bytes, it's doing a function call and two calls to a function pointer.
With Ant's approach, both 8 and 16 bytes require a single call to a
function pointer, and of course we are using the AVX-512 implementation for
both.

I think this is sufficient to justify switching approaches.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Popcount optimization using AVX512

Reply via email to