>>> I'm not against continuing with the more well-known approach for now >>> but we should keep in mind that might still be potential for improvement. > > No. I don't think it's faster.
I did a quick check on my x86 laptop and it's roughly 25% faster there. That's consistent with the literature. RISC-V qemu only shows 5-10% improvement, though. > I have no ideal. I saw ARM SVE generate: > POP_COUNT > POP_COUNT > VEC_PACK_TRUNC. I'd strongly suspect this happens because it's converting to int. If you change dst to uint64_t there won't be any vec_pack_trunc. > I am gonna drop this patch since it's meaningless. But why? It can still help even if we can improve on the sequence. IMHO you can go ahead with it and just change int -> uint64_t in the tests. Regards Robin