On Dec 8, 2013, at 7:34 PM, Alexander Wu <alexander...@huawei.com> wrote:
> Hi Jarno, > > I get my gcc predefined __core2. But its performance seems to be worse when > I add '-O2'. Not sure if it's the reality. > >From the numbers below it seems that performance is better with -O2 (1063893 < >1317450), so I’m not sure what you mean here. > Here are part of my test code, compile command and its result. > > Code: > > uint32_t i, last_bits; > struct timespec start = {0}; > struct timespec end = {0}; > srand(time(NULL)); > int r = rand(); > #define N_LOOP 100000 > int random_array[N_LOOP]; > > srand(time(NULL)); > for (i = 0; i < N_LOOP; i++) { > r = rand(); > random_array[i] = r; > } > > //__builtin_popcount > clock_gettime(CLOCK_THREAD_CPUTIME_ID, &start); > for (i = 0; i < N_LOOP; i++) { > last_bits = __builtin_popcount(random_array[i]); > } > clock_gettime(CLOCK_THREAD_CPUTIME_ID, &end); > printf("time-diff:%ld\n", end.tv_nsec - start.tv_nsec); > printf("last-bits:%d\n", last_bits); > > //original ovs count_1bits_32 > clock_gettime(CLOCK_THREAD_CPUTIME_ID, &start); > for (i = 0; i < N_LOOP; i++) { > last_bits = count_1bits_32(random_array[i]); > } > clock_gettime(CLOCK_THREAD_CPUTIME_ID, &end); > printf("time-diff:%ld\n", end.tv_nsec - start.tv_nsec); > printf("last-bits:%d\n", last_bits); > > //simple foo function, to count '=' and function time. > clock_gettime(CLOCK_THREAD_CPUTIME_ID, &start); > for (i = 0; i < N_LOOP; i++) { > last_bits = foo(); > } > clock_gettime(CLOCK_THREAD_CPUTIME_ID, &end); > printf("time-diff:%ld\n", end.tv_nsec - start.tv_nsec); > printf("last-bits:%d\n", last_bits); > > Compile: > gcc bit1.c -o bit1 -march=native -mtune=native -lrt -O2 && ./bit1 > > Result: > > time-diff:1063893 //__builtin_popcount > last-bits:10 > time-diff:293463 //original ovs count_1bits_32 > last-bits:10 > time-diff:188 //simple foo function, to count '=' and function > time.(maybe it has been optimized out) > last-bits:99999 > > Result without -O2: > > time-diff:1317450 > last-bits:10 > time-diff:991438 > last-bits:10 > time-diff:416265 > last-bits:99999 > > > Note I use last_bits to restore the return value, and when I use it, > performance of __builtin_popcount seems to decrease, I guess compiler > optimize __builtin_popcount as its wish like -O2. You could prevent optimizations by adding instead of simply assigning, (i.e., “last_bits += …”). > > So do you think it's enough to represent __builtin_popcount is not > suitable for __core2? > Seems so, and it also makes sense as Core2 does not have the popcnt instruction. Jarno _______________________________________________ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev