[Bug tree-optimization/101301] Improving sparse switch statement

tkoenig at gcc dot gnu.org via Gcc-bugs Sat, 03 Jul 2021 11:42:12 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101301


Thomas Koenig <tkoenig at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |segher at gcc dot gnu.org

--- Comment #2 from Thomas Koenig <tkoenig at gcc dot gnu.org> ---
Some benchmarks are interesting, and it seems that things are a lot less clear
than I thought.

a.c contains the code in the original test. bench.c has


#include <stdio.h>
#include <sys/time.h>
#include <stdlib.h>

#define N 1024
#define TIMES (1024*1024)

int foo(int);
int foo2(int);

const int nums[] =
#ifdef FULL
  {  5000, 11111, 15000, 22222, 25000, 33333, 35000, 44444, 50000,
    55555, 60000, 66666, 70000, 77777, 80000, 88888, 90000, 99999,
     100000};
#else
  {  11111, 22222, 33333, 44444, 55555, 66666, 77777, 88888, 99999 };
#endif

double this_time ()
{
  struct timeval tv;
  gettimeofday (&tv, NULL);
  return tv.tv_sec + tv.tv_usec * 1e-6;
}

int main()
{
  double t1, t2;
  int *x;
  int r;
  int s;
  x = malloc(N * sizeof(*x));
  for (int i=0; i<N; i++)
    {
      r = drand48() * sizeof(nums)/sizeof(int);
      x[i] = nums[r];
    }
  t1 = this_time();
  for (int j=0; j<TIMES; j++)
    for (int i=0; i<N; i++)
      foo (x[i]);
  t2 = this_time();
  printf ("foo: %f\n", t2-t1);
  t1 = this_time();
  for (int j=0; j<TIMES; j++)
    for (int i=0; i<N; i++)
      foo2 (x[i]);
  t2 = this_time();
  printf ("foo2: %f\n", t2-t1);
  return 0;
}

On my home box, a Ryzen 1700, I get
$ gcc -O3 bench.c a.c && ./a.out
foo: 4.324224
foo2: 4.218115
$ gcc -DFULL -O3 bench.c a.c && ./a.out
foo: 4.132713
foo2: 3.924765

so there is an advantage for the hand-crafted binary search, but it
is rather small (up to 5%).

On POWER, the results are

[tkoenig@gcc135 ~]$ gcc -O3 bench.c a.c && ./a.out
foo: 6.489096
foo2: 8.457161
[tkoenig@gcc135 ~]$ gcc -O3 -DFULL bench.c a.c && ./a.out
foo: 6.481499
foo2: 6.976012

so things are actually slower with the manual version of the switch.

Things get very interesting when checking against clang 12:
[tkoenig@gcc135 ~]$ clang -O3 bench.c a.c && ./a.out
foo: 7.489085
foo2: 4.175748
[tkoenig@gcc135 ~]$ clang -O3 -DFULL bench.c a.c && ./a.out
foo: 8.142711
foo2: 3.983875

clang's own switch is slower, but the handcrafted version is very fast.
clang apparently decides to make heavy use of POWER's condition
registers and isel:

00000000000000e0 <foo2>:
  e0:   ce 56 83 2c     cmpwi   cr1,r3,22222
  e4:   00 00 80 3c     lis     r4,0
  e8:   00 00 a0 38     li      r5,0
  ec:   02 00 c0 38     li      r6,2
  f0:   01 00 e0 38     li      r7,1
  f4:   01 00 00 3d     lis     r8,1
  f8:   67 2b 83 2a     cmplwi  cr5,r3,11111
  fc:   9c ad 03 28     cmplwi  r3,44444
 100:   35 82 89 60     ori     r9,r4,33333
 104:   04 00 40 39     li      r10,4
 108:   6a 04 08 61     ori     r8,r8,1130
 10c:   03 d9 84 60     ori     r4,r4,55555
 110:   9e 29 c6 7c     isel    r6,r6,r5,6
 114:   35 82 03 2b     cmplwi  cr6,r3,33333
 118:   00 48 83 7c     cmpw    cr1,r3,r9
 11c:   9e 35 c7 7c     isel    r6,r7,r6,22
 120:   03 00 e0 38     li      r7,3
 124:   01 00 69 6c     xoris   r9,r3,1
 128:   9e 28 4a 7d     iseleq  r10,r10,r5
 12c:   40 40 03 7c     cmplw   r3,r8
 130:   66 2b 08 39     addi    r8,r8,11110
 134:   d1 2f 89 2a     cmplwi  cr5,r9,12241
 138:   9e 56 e7 7c     isel    r7,r7,r10,26
 13c:   06 00 40 39     li      r10,6
 140:   9f 86 09 2b     cmplwi  cr6,r9,34463
 144:   38 5b 89 2b     cmplwi  cr7,r9,23352
 148:   08 00 20 39     li      r9,8
 14c:   9e 28 4a 7d     iseleq  r10,r10,r5
 150:   00 40 03 7c     cmpw    r3,r8
 154:   09 00 00 39     li      r8,9
 158:   9e 2f a9 7c     isel    r5,r9,r5,30
 15c:   1e 39 c6 7c     isel    r6,r6,r7,4
 160:   05 00 e0 38     li      r7,5
 164:   03 d9 83 28     cmplwi  cr1,r3,55555
 168:   9e 2e a8 7c     isel    r5,r8,r5,26
 16c:   07 00 00 39     li      r8,7
 170:   9e 51 e7 7c     isel    r7,r7,r10,6
 174:   00 20 83 7c     cmpw    cr1,r3,r4
 178:   9e 2d a8 7c     isel    r5,r8,r5,22
 17c:   5e 38 65 7c     iselgt  r3,r5,r7
 180:   1e 19 66 7c     isel    r3,r6,r3,4
 184:   20 00 63 78     clrldi  r3,r3,32
 188:   20 00 80 4e     blr

I don't have any recent Intel hardware to test.

So, things are less clear then when just looking at the assembler code,
and the behavior of branch predictors is not easy to predict.

A small (up to 5% advantage) x86_64 could be a big disadvantage or
advantage on POWER, depending on how it is compiled...

[Bug tree-optimization/101301] Improving sparse switch statement

Reply via email to