On Dec 5, 2013, at 4:44 PM, Ben Pfaff <b...@nicira.com> wrote: > On Thu, Dec 05, 2013 at 04:36:26PM -0800, Jarno Rajahalme wrote: >> Inline, use another well-known algorithm for 64-bit builds, and use >> builtins when they are known to be fast at compile time. A 32-bit >> version of the alternate algorithm is slower than the existing >> implementation, so the old one is used for 32-bit builds. Inline >> assembler would be a bit faster on 32-bit i7 build, but we use the GCC >> builtin for portability. >> >> It should be stressed builds for specific CPUs do not work on others >> CPUs, and that OVS build system or runtime does not currently support >> CPU detection. >> >> Speed improvement v.s. existing implementation / GCC 4.7 >> __builtin_popcountll(): >> >> i386: 64% (inlining) / 380% >> i386 on i7: 240% (inlining + builtin) / 820% >> x86_64: 59% (inlining + different algorithm) / 190% >> x86_64 on i7: 370% (inlining + builtin) / 0% >> >> Signed-off-by: Jarno Rajahalme <jrajaha...@nicira.com> > > Wow. > > How did you measure the benefit of inlining?
With a standalone C program running different variants (original, inlined, different algorithm) over randomized data of 10000 u64s about 100000 times and timing each algorithm with clock_gettime(CLOCK_THREAD_CPUTIME_ID). Compiled for 32 or 64 bits with -m32 or without, and for i7 with -march=native -mtune=native. I used -O2 as we compile OVS with -O2, too. The alternate algorithm computes the popcount in parallel for the 64 bits (so it is equally fast for both 32 and 64 bits), while the recalculated array requires double the work for 64 bits, compared to 32 bits. But on the other hand, the array lookup is still faster for 32 bits. Jarno
_______________________________________________ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev