Re: Re: [PATCH V2] RISC-V: Support POPCOUNT auto-vectorization

钟居哲 Mon, 31 Jul 2023 15:03:33 -0700

>> From my recollection this is usually 30-40% faster than the naive tree
>> adder and also amenable to vectorization.  As long as the multiplication
>> is not terribly slow, that is.  Mula's algorithm should be significantly
>> faster even, another 30% IIRC.
 
>> I'm not against continuing with the more well-known approach for now
>> but we should keep in mind that might still be potential for improvement.


No. I don't think it's faster.

>> Wait, why do we need vec_pack_trunc for popcountll?  For me vectorizing
>> it "just works" when the output is a uint64_t just like the standard
>> name demands.

>> If you're referring to something else, please detail in the comment.

I have no ideal. I saw ARM SVE generate:
POP_COUNT
POP_COUNT
VEC_PACK_TRUNC.

I am gonna drop this patch since it's meaningless.

Thanks.


juzhe.zh...@rivai.ai
 
From: Robin Dapp
Date: 2023-08-01 03:38
To: Juzhe-Zhong; gcc-patches
CC: rdapp.gcc; kito.cheng; kito.cheng
Subject: Re: [PATCH V2] RISC-V: Support POPCOUNT auto-vectorization
Hi Juzhe,
 
> +/* Expand Vector POPCOUNT by parallel popcnt:
> +
> +   int parallel_popcnt(uint32_t n) {
> +   #define POW2(c)      (1U << (c))
> +   #define MASK(c)      (static_cast<uint32_t>(-1) / (POW2(POW2(c)) + 1U))
> +   #define COUNT(x, c)  ((x) & MASK(c)) + (((x)>>(POW2(c))) & MASK(c))
> + n = COUNT(n, 0);
> + n = COUNT(n, 1);
> + n = COUNT(n, 2);
> + n = COUNT(n, 3);
> + n = COUNT(n, 4);
> +   // n = COUNT(n, 5);  // uncomment this line for 64-bit integers
> + return n;
> +   #undef COUNT
> +   #undef MASK
> +   #undef POW2
> +   }
 
That's quite a heavy implementation but I suppose with the proper cost
function it can still be worth it.  Did you also try some alternatives?
WWG comes to mind:
 
uint64_t c1 = 0x5555555555555555;
uint64_t c2 = 0x3333333333333333;
uint64_t c4 = 0x0F0F0F0F0F0F0F0F;
 
uint64_t wwg (uint64_t x) {
    x -= (x >> 1) & c1;
    x = ((x >> 2) & c2) + (x & c2);
    x = (x + (x >> 4) ) & c4;
    x *= 0x0101010101010101;
    return x >> 56;
}
 
From my recollection this is usually 30-40% faster than the naive tree
adder and also amenable to vectorization.  As long as the multiplication
is not terribly slow, that is.  Mula's algorithm should be significantly
faster even, another 30% IIRC.
 
I'm not against continuing with the more well-known approach for now
but we should keep in mind that might still be potential for improvement.
 
>  } // namespace riscv_vector
> diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/widen/popcount-1.c 
> b/gcc/testsuite/gcc.target/riscv/rvv/autovec/widen/popcount-1.c
 
Any particular reason why the tests are in widen?
 
> +extern void abort (void) __attribute__ ((noreturn));
 
Why no __builtin_unreachable as in the other tests? 
 
> +      asm volatile ("" ::: "memory");
 
Is this necessary?  I doesn't hurt of course, just wondering.
 
All in all LGTM in case you'd rather get this upstream now.  We can
always improve later.
 
Regards
Robin

Re: Re: [PATCH V2] RISC-V: Support POPCOUNT auto-vectorization

Reply via email to