Paolo 'Blaisorblade' Giarrusso <p.giarru...@gmail.com> added the comment:
About miscompilations: the current patch is a bit weird for GCC, because you keep both the switch and the computed goto. But actually, there is no case in which the switch is needed, and computed goto give less room to GCC's choices. So, can you try dropping the switch altogether, using always computed goto and seeing how does the resulting code get compiled? I see you'll need two labels (before and after argument fetch) per opcode and two dispatch tabels, but that's no big deal (except for code alignment - align just the common branch target). An important warning is that by default, on my system, GCC 4.2 aligns branch targets for switch to a 16-byte boundary (as recommended by the Intel optimization guide), by adding a ".p2align 4,,7" GAS directive, and it does not do that for computed goto. Adding the directive by hand gave a small speedup, 2% I think; I should try -falign-jumps=16 if it's not enabled (some -falign-jumps is enabled by -O2), since that is supposed to give the same result. Please use that yourself as well, and verify it works for labels, even if I fear it doesn't. > However, I don't know why the speed up due to the patch is much more significant on x86-64 than on x86. It's Amdahl's law, even if this is not about parallel code. When the rest is faster (x86_64), the same speedup on dispatch gives a bigger overall speedup. To be absolutely clear: x86_64 has more registers, so the rest of the interpreter is faster than x86, but dispatch still takes the same absolute time, which is 70% on x86_64, but only 50% on x86 (those are realistic figures); if this patch halved dispatch time on both (we're not so lucky), we would save 35% on x86_64 but only 25% on x86. In fact, on inefficient interpreters, indirect threading is useless altogether. So, do those extra register help _so_ much? Yes. In my toy interpreter, computing last_i for each dispatch doesn't give any big slowdown, but storing it in f->last_i gives a ~20% slowdown - I cross-checked multiple times because I was astonished. Conversely, when the program counter had to be stored in memory, I think it was like 2x slower. _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue4753> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com