Hi Jarno,

I've read your patch "better count1_bits", and I test the gcc
builtins separately.

Call __builtin_popcount|__builtin_popcountl|__builtin_popcountll 10 million 
times
--------------------------------------
    suse-kvm-of13:/test # time ./bit4

    real    0m0.034s
    user    0m0.032s
    sys     0m0.000s

Call count1_bits 10 million times
--------------------------------------
    suse-kvm-of13:/test # time ./bit1

    real    0m0.080s
    user    0m0.076s
    sys     0m0.000s

Looks good, but I've a problem below.
My cpuinfo: 16U * Intel(R) Xeon(R) CPU E5620 @ 2.40GHz. (westmere)
I've read gcc source, find M_INTEL_COREI7_WESTMERE, it seems
to say westmere is corei7, but the following code doesn't work:

    #if defined(__corei7)
        int i;
        for (i = 0; i < 10000000; i++)
            __builtin_popcount(i);
    #endif

I believe there're some particuler cpus which the buildin_popcount
is suitable for, any way to represent them?

On 06/12/2013 12:26, Ben Pfaff wrote:
On Fri, Dec 06, 2013 at 12:02:59PM +0800, Alexander Wu wrote:
Some additional test for bitmap performance (10 million loops):

     suse-kvm-of13:/test # time ./bit3

     real    0m0.061s
     user    0m0.060s
     sys     0m0.000s
     suse-kvm-of13:/test # time ./bit2

     real    0m0.088s
     user    0m0.084s
     sys     0m0.000s
     suse-kvm-of13:/test # time ./bit1

     real    0m0.080s
     user    0m0.072s
     sys     0m0.004s

[bit1]: count1_bits
[bit2]: NumberOfSetBits from http://stackoverflow.com/a/109025
[bit3]: precomputed16_bitcount from 
http://gurmeet.net/puzzles/fast-bit-counting-routines/bitcount.c

After reading the implements, I understand how precomputed16_bitcount wins.
count_1bits_32 inits 256 bytes array, while precomputed16_bitcount
inits 65536 bytes array and reduce the calculation:

count1_bits:
     return (count_1bits_8[x & 0xff] +
             count_1bits_8[(x >> 8) & 0xff] +
             count_1bits_8[(x >> 16) & 0xff] +
             count_1bits_8[x >> 24]);

precomputed16_bitcount:
     return bits_in_16bits [n         & 0xffffu]
         +  bits_in_16bits [(n >> 16) & 0xffffu] ;

So how about change to the fastest one?

Please talk to Jarno, who has been looking at a related change.

But I'm inclined to believe that a 65536-byte array wastes too much
memory.






_______________________________________________
dev mailing list
dev@openvswitch.org
http://openvswitch.org/mailman/listinfo/dev

Reply via email to