https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109011
--- Comment #5 from Jakub Jelinek <jakub at gcc dot gnu.org> --- And to answer myself, as x86 has vplzcnt* just for 32-bit and 64-bit elts with -mavx512cd (perhaps -mavx512vl also depending on vecsize), there is also 8-bit and 16-bit element vector popcount (guarded by different options). And with popcount it would be 3 instructions instead of 4, though dunno about their latencies etc.