>> From my recollection this is usually 30-40% faster than the naive tree >> adder and also amenable to vectorization. As long as the multiplication >> is not terribly slow, that is. Mula's algorithm should be significantly >> faster even, another 30% IIRC. >> I'm not against continuing with the more well-known approach for now >> but we should keep in mind that might still be potential for improvement.
No. I don't think it's faster. >> Wait, why do we need vec_pack_trunc for popcountll? For me vectorizing >> it "just works" when the output is a uint64_t just like the standard >> name demands. >> If you're referring to something else, please detail in the comment. I have no ideal. I saw ARM SVE generate: POP_COUNT POP_COUNT VEC_PACK_TRUNC. I am gonna drop this patch since it's meaningless. Thanks. juzhe.zh...@rivai.ai From: Robin Dapp Date: 2023-08-01 03:38 To: Juzhe-Zhong; gcc-patches CC: rdapp.gcc; kito.cheng; kito.cheng Subject: Re: [PATCH V2] RISC-V: Support POPCOUNT auto-vectorization Hi Juzhe, > +/* Expand Vector POPCOUNT by parallel popcnt: > + > + int parallel_popcnt(uint32_t n) { > + #define POW2(c) (1U << (c)) > + #define MASK(c) (static_cast<uint32_t>(-1) / (POW2(POW2(c)) + 1U)) > + #define COUNT(x, c) ((x) & MASK(c)) + (((x)>>(POW2(c))) & MASK(c)) > + n = COUNT(n, 0); > + n = COUNT(n, 1); > + n = COUNT(n, 2); > + n = COUNT(n, 3); > + n = COUNT(n, 4); > + // n = COUNT(n, 5); // uncomment this line for 64-bit integers > + return n; > + #undef COUNT > + #undef MASK > + #undef POW2 > + } That's quite a heavy implementation but I suppose with the proper cost function it can still be worth it. Did you also try some alternatives? WWG comes to mind: uint64_t c1 = 0x5555555555555555; uint64_t c2 = 0x3333333333333333; uint64_t c4 = 0x0F0F0F0F0F0F0F0F; uint64_t wwg (uint64_t x) { x -= (x >> 1) & c1; x = ((x >> 2) & c2) + (x & c2); x = (x + (x >> 4) ) & c4; x *= 0x0101010101010101; return x >> 56; } From my recollection this is usually 30-40% faster than the naive tree adder and also amenable to vectorization. As long as the multiplication is not terribly slow, that is. Mula's algorithm should be significantly faster even, another 30% IIRC. I'm not against continuing with the more well-known approach for now but we should keep in mind that might still be potential for improvement. > } // namespace riscv_vector > diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/widen/popcount-1.c > b/gcc/testsuite/gcc.target/riscv/rvv/autovec/widen/popcount-1.c Any particular reason why the tests are in widen? > +extern void abort (void) __attribute__ ((noreturn)); Why no __builtin_unreachable as in the other tests? > + asm volatile ("" ::: "memory"); Is this necessary? I doesn't hurt of course, just wondering. All in all LGTM in case you'd rather get this upstream now. We can always improve later. Regards Robin