On 10/12/2013 04:50, Jarno Rajahalme wrote:
On Dec 8, 2013, at 7:34 PM, Alexander Wu <alexander...@huawei.com> wrote:
Hi Jarno,
I get my gcc predefined __core2. But its performance seems to be worse when
I add '-O2'. Not sure if it's the reality.
From the numbers below it seems that performance is better with -O2 (1063893 <
1317450), so I’m not sure what you mean here.
I mean: (-O2)1063893/293463 > (no -O2)1317450/991438, so it seems worse.
Here are part of my test code, compile command and its result.
Code:
uint32_t i, last_bits;
struct timespec start = {0};
struct timespec end = {0};
srand(time(NULL));
int r = rand();
#define N_LOOP 100000
int random_array[N_LOOP];
srand(time(NULL));
for (i = 0; i < N_LOOP; i++) {
r = rand();
random_array[i] = r;
}
//__builtin_popcount
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &start);
for (i = 0; i < N_LOOP; i++) {
last_bits = __builtin_popcount(random_array[i]);
}
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &end);
printf("time-diff:%ld\n", end.tv_nsec - start.tv_nsec);
printf("last-bits:%d\n", last_bits);
//original ovs count_1bits_32
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &start);
for (i = 0; i < N_LOOP; i++) {
last_bits = count_1bits_32(random_array[i]);
}
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &end);
printf("time-diff:%ld\n", end.tv_nsec - start.tv_nsec);
printf("last-bits:%d\n", last_bits);
//simple foo function, to count '=' and function time.
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &start);
for (i = 0; i < N_LOOP; i++) {
last_bits = foo();
}
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &end);
printf("time-diff:%ld\n", end.tv_nsec - start.tv_nsec);
printf("last-bits:%d\n", last_bits);
Compile:
gcc bit1.c -o bit1 -march=native -mtune=native -lrt -O2 && ./bit1
Result:
time-diff:1063893 //__builtin_popcount
last-bits:10
time-diff:293463 //original ovs count_1bits_32
last-bits:10
time-diff:188 //simple foo function, to count '=' and function
time.(maybe it has been optimized out)
last-bits:99999
Result without -O2:
time-diff:1317450
last-bits:10
time-diff:991438
last-bits:10
time-diff:416265
last-bits:99999
Note I use last_bits to restore the return value, and when I use it,
performance of __builtin_popcount seems to decrease, I guess compiler
optimize __builtin_popcount as its wish like -O2.
You could prevent optimizations by adding instead of simply assigning, (i.e.,
“last_bits += …”).
Thanks, it works.
So do you think it's enough to represent __builtin_popcount is not
suitable for __core2?
Seems so, and it also makes sense as Core2 does not have the popcnt instruction.
Jarno
.
Thanks!
Best regards,
Alexander Wu
_______________________________________________
dev mailing list
dev@openvswitch.org
http://openvswitch.org/mailman/listinfo/dev