The following uses a file "ceval.i" that I'll upload shortly and a nightly build of gcc-4.4.0-20090223. The file is a version of Python's core interpretation loop modified to use computed gotos in order to help the processor's branch predictor do a better job than with a switch statement. However, gcc combines the computed gotos into only a few instructions, producing a result nearly as bad as the switch.
The computed gotos are all of the form "goto *next_label;". About half of the instances counted are reachable, as measured by the asm volatile farther down. $ grep -c -F 'goto *next_label' ceval.i 256 I compile with -fno-gcse because http://gcc.gnu.org/onlinedocs/gcc-4.3.3/gcc/Optimize-Options.html#index-fgcse-646 says it helps and with "--param max-goto-duplication-insns=100000" because Diego Novillo suggested tweaking that parameter. $ gcc-4.4 -m32 -pthread -fno-strict-aliasing -g -fwrapv -O3 -Wall -Wstrict-prototypes -fno-gcse --param max-goto-duplication-insns=100000 -S -dA ceval.i -o ceval.s $ egrep -c 'jmp[[:space:]]*\*' ceval.s 4 I tried to make the common instruction sequence leading up to the indirection jump shorter by putting an asm volatile before it, but that didn't work. You can see the result by running: $ sed 's!goto \*next_label!asm volatile ("/* Computed goto */":::"memory");goto *next_label!' < ceval.i > ceval-asm.i $ gcc-4.4 -m32 -pthread -fno-strict-aliasing -g -fwrapv -O3 -fno-gcse --param max-goto-duplication-insns=100000 -S -dA ceval-asm.i -o ceval-asm.s $ egrep -c 'Computed goto' ceval-asm.s 130 $ egrep -c 'jmp[[:space:]]*\*' ceval-asm.s 4 One of the relevant bits of assembly (search for "Computed goto" in ceval-asm.s) is: .L1125: # basic block 66 .LBE835: # ../src/Python/ceval.c:1000 .loc 1 1000 0 jmp *-72(%ebp) .LVL558: .p2align 4,,7 .p2align 3 .L1433: # basic block 67 ... # basic block 114 #APP # 11 "../src/Include/ceval-vm.i" 1 /* Computed goto */ # 0 "" 2 #NO_APP movl %esi, %ebx .LVL599: .p2align 4,,7 .p2align 3 .L484: # basic block 115 # ../src/Python/ceval.c:1000 .loc 1 1000 0 movl $1, %eax .LVL600: movl %ebx, %esi .LVL601: jmp .L1125 The documentation for max-goto-duplication-insns says: The maximum number of instructions to duplicate to a block that jumps to a computed goto. To avoid O(N^2) behavior in a number of passes, GCC factors computed gotos early in the compilation process, and unfactors them as late as possible. Only computed jumps at the end of a basic blocks with no more than max-goto-duplication-insns are unfactored. The default value is 8. Since .L1125 and basic block 66 have only a single instruction, the computed goto should have been unfactored. $ gcc-4.4 -v Using built-in specs. Target: i686-pc-linux-gnu Configured with: /home/dnovillo/perf/sbox/gcc/local.x86_64/src/configure --prefix=/home/dnovillo/perf/sbox/gcc/local.x86_64/inst --srcdir=/home/dnovillo/perf/sbox/gcc/local.x86_64/src --with-gmp=/home/dnovillo/i686 --with-mpfr=/home/dnovillo/i686 --build=i686-pc-linux-gnu --enable-languages=c++,fortran,objc,obj-c++ Thread model: posix gcc version 4.4.0 20090223 (experimental) (GCC) -- Summary: Computed gotos combined too aggressively Product: gcc Version: 4.4.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: jyasskin at gmail dot com http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39284