https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101301
Thomas Koenig <tkoenig at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |segher at gcc dot gnu.org --- Comment #2 from Thomas Koenig <tkoenig at gcc dot gnu.org> --- Some benchmarks are interesting, and it seems that things are a lot less clear than I thought. a.c contains the code in the original test. bench.c has #include <stdio.h> #include <sys/time.h> #include <stdlib.h> #define N 1024 #define TIMES (1024*1024) int foo(int); int foo2(int); const int nums[] = #ifdef FULL { 5000, 11111, 15000, 22222, 25000, 33333, 35000, 44444, 50000, 55555, 60000, 66666, 70000, 77777, 80000, 88888, 90000, 99999, 100000}; #else { 11111, 22222, 33333, 44444, 55555, 66666, 77777, 88888, 99999 }; #endif double this_time () { struct timeval tv; gettimeofday (&tv, NULL); return tv.tv_sec + tv.tv_usec * 1e-6; } int main() { double t1, t2; int *x; int r; int s; x = malloc(N * sizeof(*x)); for (int i=0; i<N; i++) { r = drand48() * sizeof(nums)/sizeof(int); x[i] = nums[r]; } t1 = this_time(); for (int j=0; j<TIMES; j++) for (int i=0; i<N; i++) foo (x[i]); t2 = this_time(); printf ("foo: %f\n", t2-t1); t1 = this_time(); for (int j=0; j<TIMES; j++) for (int i=0; i<N; i++) foo2 (x[i]); t2 = this_time(); printf ("foo2: %f\n", t2-t1); return 0; } On my home box, a Ryzen 1700, I get $ gcc -O3 bench.c a.c && ./a.out foo: 4.324224 foo2: 4.218115 $ gcc -DFULL -O3 bench.c a.c && ./a.out foo: 4.132713 foo2: 3.924765 so there is an advantage for the hand-crafted binary search, but it is rather small (up to 5%). On POWER, the results are [tkoenig@gcc135 ~]$ gcc -O3 bench.c a.c && ./a.out foo: 6.489096 foo2: 8.457161 [tkoenig@gcc135 ~]$ gcc -O3 -DFULL bench.c a.c && ./a.out foo: 6.481499 foo2: 6.976012 so things are actually slower with the manual version of the switch. Things get very interesting when checking against clang 12: [tkoenig@gcc135 ~]$ clang -O3 bench.c a.c && ./a.out foo: 7.489085 foo2: 4.175748 [tkoenig@gcc135 ~]$ clang -O3 -DFULL bench.c a.c && ./a.out foo: 8.142711 foo2: 3.983875 clang's own switch is slower, but the handcrafted version is very fast. clang apparently decides to make heavy use of POWER's condition registers and isel: 00000000000000e0 <foo2>: e0: ce 56 83 2c cmpwi cr1,r3,22222 e4: 00 00 80 3c lis r4,0 e8: 00 00 a0 38 li r5,0 ec: 02 00 c0 38 li r6,2 f0: 01 00 e0 38 li r7,1 f4: 01 00 00 3d lis r8,1 f8: 67 2b 83 2a cmplwi cr5,r3,11111 fc: 9c ad 03 28 cmplwi r3,44444 100: 35 82 89 60 ori r9,r4,33333 104: 04 00 40 39 li r10,4 108: 6a 04 08 61 ori r8,r8,1130 10c: 03 d9 84 60 ori r4,r4,55555 110: 9e 29 c6 7c isel r6,r6,r5,6 114: 35 82 03 2b cmplwi cr6,r3,33333 118: 00 48 83 7c cmpw cr1,r3,r9 11c: 9e 35 c7 7c isel r6,r7,r6,22 120: 03 00 e0 38 li r7,3 124: 01 00 69 6c xoris r9,r3,1 128: 9e 28 4a 7d iseleq r10,r10,r5 12c: 40 40 03 7c cmplw r3,r8 130: 66 2b 08 39 addi r8,r8,11110 134: d1 2f 89 2a cmplwi cr5,r9,12241 138: 9e 56 e7 7c isel r7,r7,r10,26 13c: 06 00 40 39 li r10,6 140: 9f 86 09 2b cmplwi cr6,r9,34463 144: 38 5b 89 2b cmplwi cr7,r9,23352 148: 08 00 20 39 li r9,8 14c: 9e 28 4a 7d iseleq r10,r10,r5 150: 00 40 03 7c cmpw r3,r8 154: 09 00 00 39 li r8,9 158: 9e 2f a9 7c isel r5,r9,r5,30 15c: 1e 39 c6 7c isel r6,r6,r7,4 160: 05 00 e0 38 li r7,5 164: 03 d9 83 28 cmplwi cr1,r3,55555 168: 9e 2e a8 7c isel r5,r8,r5,26 16c: 07 00 00 39 li r8,7 170: 9e 51 e7 7c isel r7,r7,r10,6 174: 00 20 83 7c cmpw cr1,r3,r4 178: 9e 2d a8 7c isel r5,r8,r5,22 17c: 5e 38 65 7c iselgt r3,r5,r7 180: 1e 19 66 7c isel r3,r6,r3,4 184: 20 00 63 78 clrldi r3,r3,32 188: 20 00 80 4e blr I don't have any recent Intel hardware to test. So, things are less clear then when just looking at the assembler code, and the behavior of branch predictors is not easy to predict. A small (up to 5% advantage) x86_64 could be a big disadvantage or advantage on POWER, depending on how it is compiled...