https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87528
--- Comment #6 from rguenther at suse dot de <rguenther at suse dot de> --- On Mon, 8 Oct 2018, jamborm at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87528 > > --- Comment #5 from Martin Jambor <jamborm at gcc dot gnu.org> --- > (In reply to Richard Biener from comment #3) > > Can you point me to the source for which we generate the popcount call(s)? > > It might be not final value replacement but instead code-generating a niter > > analysis result. > > It turns out it is as simple as: > > ---------------------------------------- > typedef unsigned long long BITBOARD; > > int PopCount (BITBOARD b) { > int c = 0; > > while (b) { > b &= b - 1; > c++; > } > > return c; > } > ---------------------------------------- > with -O3. Trunk generates a call to __popcountdi2 when GCC 7 (or > trunk with -march=skylake) does not. OK, I see. They have that PopCount and also ThickPopCount which appearantly is supposed to handle the high-density case better. They have also open-coded ffs() & friends. I guess without any idea of 'b' we need to assume that 'b' has almost no bits set and thus the loop will be fast(er than a call). This was also noted elsewhere and a GCC discussion circled around the fact that we introduce popcount() into other contexts (the above is final value replacement). So that raises the question whether we should simply open-code popcount on the target? I wonder if you can benchmark using ThickPopCount everywhere (for a non-loopy replacement). Note currently final value replacement uses expression_expensive_p which eventually just checks is_inexpensive_builtin which suffers from the very same issue (in theory). Patching is_inexpensive_builtin to check whether we have an optab might do the trick (with other side-effects, like on inlining, of course).