On Sat, Mar 22, 2025 at 10:42 AM Nathan Bossart <nathandboss...@gmail.com> wrote:
> * 0002 introduces the Neon implementation, which conveniently doesn't need > configure-time checks or function pointers. I noticed that some > compilers (e.g., Apple clang 16) compile in Neon instructions already, > but our hand-rolled implementation is better about instruction-level > parallelism and seems to still be quite a bit faster. +pg_popcount64(uint64 word) +{ + return vaddv_u8(vcnt_u8(vld1_u8((const uint8 *) &word))); +} This confused me until I found that this is what __builtin_popcountl(word) would emit anyway. Worth a comment? Some thoughts to consider, some speculative and maybe not worth putting time into: > I did add a 2-register block > in the Neon implementation for processing the tail because I was worried > about its performance on smaller buffers, but that part might get removed > if I can't measure any difference. Even if we can measure a difference on fixed-sized inputs, that might not carry over when the branch is unpredictable. + /* + * Process remaining 8-byte blocks. + */ + for (; bytes >= sizeof(uint64); bytes -= sizeof(uint64)) + { + popcnt += pg_popcount64(*((uint64 *) buf)); + buf += sizeof(uint64); + } This uses 16-byte registers, but only loads 8-bytes at a time (with accumulation work), then a bytewise tail up to 7 bytes. Alternatively, you could instead do a loop over a single local accumulator, which I think could have a short accumulation pipeline since 3 iterations can't overflow 8-bit lanes. But then the bytewise tail could be up to 15 bytes. > * 0003 introduces the SVE implementation. You'll notice I've moved all the > function pointer gymnastics into the pg_popcount_aarch64.c file, which is > where the Neon implementations live, too. I also tried to clean up the > configure checks a bit. I imagine it's possible to make them more > compact, but I felt that the enhanced readability was worth it. I don't know what the configure checks looked like before, but I'm confused that the loops are unrolled in the link-test functions as well. > * For both Neon and SVE, I do see improvements with looping over 4 > registers at a time, so IMHO it's worth doing so even if it performs the > same as 2-register blocks on some hardware. I wonder if alignment matters for these larger blocks. -- John Naylor Amazon Web Services