The current __builtin_popcountll (and likely __builtin_popcount) are fairly slow as compared to a simple, short C version derived from what can be found in Knuth's recent publications. The following short function is about 3x as fast as the __builtin version, which runs counter to the idea that __builtin_XXX provides access to implementations that are exemplars for a given platform.
unsigned int popcount64(unsigned long long x) { x = (x & 0x5555555555555555ULL) + ((x >> 1) & 0x5555555555555555ULL); x = (x & 0x3333333333333333ULL) + ((x >> 2) & 0x3333333333333333ULL); x = (x & 0x0F0F0F0F0F0F0F0FULL) + ((x >> 4) & 0x0F0F0F0F0F0F0F0FULL); return (x * 0x0101010101010101ULL) >> 56; } This version has the additional benefit that it omits the lookup table that the current "builtin" version uses. I measured the above function vs. __builtin_popcountll with a loop like the following: t1 = clock(); for (j = 0; j < 1000000; j++) for (i = 0; i < 1024; i++) pt = popcount64(data[i]); t2 = clock(); printf("popcount64 = %d clocks\n", t2 - t1); ...where data[] is a u64 that's preinitialized. I'll attach the exact source I used, which also includes two other possible implementations of popcountll. -- Summary: Speed up builtin_popcountll Product: gcc Version: 4.2.0 Status: UNCONFIRMED Severity: enhancement Priority: P3 Component: c AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: intvnut at gmail dot com GCC build triplet: x86_64-unknown-linux-gnu GCC host triplet: x86_64-unknown-linux-gnu GCC target triplet: x86_64-unknown-linux-gnu http://gcc.gnu.org/bugzilla/show_bug.cgi?id=36041