Steve Fink wrote: > I just did all that because the magnitude of the difference between > using separate functions and using the closer-to-assembly solutions > (switch and goto) seemed too large. I can easily believe they're faster, > but not by that much, and I didn't want to unnecessarily lose the > flexibility, readability, and compilability of keeping everything > cleanly separated out into separate functions. My original purpose was just to play with dispatchers -- we really won't know what's the best solution until more of Perl 6 starts to evolve. The dispatcher is probably a very tiny piece of the performance picture anyways, so I don't think it matters (much) that computed goto is 10x faster than function call. The only VMs sensitive to dispatch seem to be those with teeny weeny instructions, like Java or actual hardware devices. Since perl has *huge* instructions it doesn't spend much time in the dispatcher. I think the best architecture would be one with two dispatch systems. A really fast/simple one (possibly obscure) that's used for small instructions like array indexing, counter increments, sub calls, etc. and another flexible/general one (possibly obscure ;) that's used for the "big" instructions like sort and eval. Maybe we could even force all those big instructions into the vtbl. The different function call dispatchers that you posted pretty much round out the simple dispatch techniques. The only other one I'd like to see is a binary decision tree rather than a sequential scan. (The tests could be chosen to maximize branch predictions too.) Seems hard to believe that there's a machine out there that could execute several test and branch instructions in the time it takes to do a table lookup though. Anways, now for something *completely* different... I built a simple TIL and also expanded the opcode sets for the switch and goto based dispatchers. Here's the timing (same conditions as before) for -O only: function call (original) 12.6 secs switch w/ expanded opcodes 6.2 secs naive TIL 5.9 secs computed goto w/ expanded opcodes 1.4 secs That's not a typo for "native" -- my TIL is incredibly naive. I had a tough time getting gcc to work with my dispatcher. It does a really good job at "dead" code elimination and other optimizations that make it really hard to interface with. Basically I'm stuck doing normal function calls because anything more either breaks with optimization or severely constrains coding style. At least the execute() function has become pretty simple now. ;) All it does is verify the code has been compiled and then: asm("call *%0" : : "r" (vm_executable)); The compiler (threader?) generates machine code in vm_executable and opcode arguments in vm_data. Here's what vm_executable looks like: (gdb) x/11i vm_executable 0x805e1c0 <vm_executable>: mov $0x8054580,%eax 0x805e1c5 <vm_executable+5>: call 0x80485d4 <op_push> 0x805e1ca <vm_executable+10>: mov $0x8054584,%eax 0x805e1cf <vm_executable+15>: call 0x80485d4 <op_push> 0x805e1d4 <vm_executable+20>: call 0x80485e8 <op_add> 0x805e1d9 <vm_executable+25>: mov $0x8054588,%eax 0x805e1de <vm_executable+30>: call 0x80485d4 <op_push> 0x805e1e3 <vm_executable+35>: call 0x80485e8 <op_add> 0x805e1e8 <vm_executable+40>: mov $0x805458c,%eax 0x805e1ed <vm_executable+45>: call 0x80485d4 <op_push> 0x805e1f2 <vm_executable+50>: call 0x80485e8 <op_add> The register assignments you see there are for the opcodes that take arguments. When vm_executable is built, an argument array is also built so that vm_executable can stay pure code. Here's an example of an opcode that takes arguments: void op_push() { register int *args asm("eax"); *vm_sp++ = *args; } The TIL uses the machine stack, but it should be very easy to swap stacks since we have to get cozy with the hardware anwyays. Light-weight threads of "pure" Perl would be super easy. At this point I'm just surprised it works at all. I have no idea if the instructions I'm cranking out are efficient. (Should they be aligned? Is there a faster mov?) I think it should be pretty simple to emit test and branch instructions directly, so in practice I'd expect this TIL to perform better than it does now. It seems like a good idea to try and coordinate efforts with another group doing native code generation (like blackdown.org). If we did inlining or peephole optimization I'm sure we'd be re-inventing stuff they've already started on. Anybody have ideas for making this *fast*? - Ken P.S. Sorry for that inane add.dat file I posted. I should have posted the 10 line script that generated it. Anyways, nobody chewed me out for posting MIME attachments so I guess I'll keep doing it...
/* This code is in the public domain. Ken Fox, Nov 2000 */ #define ITERS 100000 #define USE_TIL #include <stdlib.h> #include <stdio.h> typedef union { int i; void *p; unsigned char b[4]; } cell; void load(char *filename, cell *code) { FILE *input; cell *pc = code; if (input = fopen(filename, "r")) { int i; while (fscanf(input, " %d ", &i) != EOF) { pc->i = i; ++pc; } fclose(input); } pc->i = 0; } void dump(cell *code) { cell *pc = code; printf("byte code:\n"); while (pc->i) { printf(" %5d\n", pc->i); ++pc; } } enum { STOP = 0, PUSH = 1, ADD = 2, PRINT = 3, POP = 4, SUB = 5, NEG = 6, MUL = 7, DIV = 8 }; int count_ops(cell *code) { int icount = 0; cell *pc = code; op: switch (pc->i) { case STOP: return icount; case PUSH: ++icount; pc += 2; goto op; case ADD: ++icount; pc += 1; goto op; case PRINT: ++icount; pc += 1; goto op; } } void fixup(cell *code, void *ops[]) { cell *pc = code; op: switch (pc->i) { case STOP: pc->p = ops[pc->i]; return; case PUSH: pc->p = ops[pc->i]; pc += 2; goto op; case ADD: pc->p = ops[pc->i]; pc += 1; goto op; case PRINT: pc->p = ops[pc->i]; pc += 1; goto op; } } #ifdef USE_SWITCH void execute(cell *code) { static int stack[1000]; cell *pc = code; int *sp = stack; int x, y; op: switch (pc++->i) { case STOP: return; case PUSH: *sp++ = pc++->i; goto op; case POP: --sp; goto op; case ADD: y = *--sp; x = *--sp; *sp++ = x + y; goto op; case SUB: y = *--sp; x = *--sp; *sp++ = x - y; goto op; case NEG: x = *--sp; *sp++ = -x; goto op; case MUL: y = *--sp; x = *--sp; *sp++ = x * y; goto op; case DIV: y = *--sp; x = *--sp; *sp++ = x / y; goto op; case PRINT: printf("--> %d\n", *--sp); goto op; } } #else #ifdef USE_FUNCALL typedef void *(*op)(); cell *vm_pc; int *vm_sp; void *op_stop() { return 0; } void *op_push() { *vm_sp++ = vm_pc[1].i; return vm_pc + 2; } void *op_add() { int y = *--vm_sp; int x = *--vm_sp; *vm_sp++ = x + y; return vm_pc + 1; } void *op_print() { printf("--> %d\n", *--vm_sp); return vm_pc + 1; } void execute(cell *code) { static void *ops[] = { op_stop, op_push, op_add, op_print }; static int stack[1000]; if (code->i > 0 && code->i < 255) { fixup(code, ops); } vm_pc = code; vm_sp = stack; while (vm_pc = ((op)(vm_pc->p))()) ; } #else #ifdef USE_GOTO void execute(cell *code) { static void *ops[] = { &&op_stop, &&op_push, &&op_add, &&op_print, &&op_pop, &&op_sub, &&op_neg, &&op_mul, &&op_div }; static int stack[1000]; cell *pc = code; int *sp = stack; int x, y; if (code->i > 0 && code->i < 255) { fixup(code, ops); } goto *((pc++)->p); op_stop: return; op_push: *sp++ = pc++->i; goto *((pc++)->p); op_pop: --sp; goto *((pc++)->p); op_add: y = *--sp; x = *--sp; *sp++ = x + y; goto *((pc++)->p); op_sub: y = *--sp; x = *--sp; *sp++ = x - y; goto *((pc++)->p); op_neg: x = *--sp; *sp++ = -x; goto *((pc++)->p); op_mul: y = *--sp; x = *--sp; *sp++ = x * y; goto *((pc++)->p); op_div: y = *--sp; x = *--sp; *sp++ = x / y; goto *((pc++)->p); op_print: printf("--> %d\n", *--sp); goto *((pc++)->p); } #else #ifdef USE_TIL unsigned char vm_executable[40000]; int vm_data[10000]; int *vm_sp; int op_stop() { int x; asm("pop %0" : "=g" (x)); return x; /* return to execute() */ } void op_push() { register int *args asm("eax"); *vm_sp++ = *args; } void op_add() { int x = *--vm_sp; int y = *--vm_sp; *vm_sp++ = x + y; } void op_print() { printf("--> %d\n", *--vm_sp); } void compile(cell *code) { static void *ops[] = { op_stop, op_push, op_add, op_print }; /* put a sequence of register eax = vm_data + offset; asm("call %0" : : "g" opcode); into vm_executable. The register assignment is only necessary if the opcode takes arguments. * 80x86 near direct call is: 11101000 disp32 0xe8 disp32 = destination_addr - addr_of_next_instruction * 80x86 code for 32-bit constant assignment to the eax register is: 10111000 constant_value 0xb8 All of the above 32-bit integers are in little endian format. Put the numbers into the target executable using: cell v; v.i = ...; memcpy(dest, v.b, 4); dest += 4; */ #define intel_call 0xe8 #define intel_mov 0xb8 unsigned char *dest = vm_executable; int *args = vm_data; cell *pc = code; cell v; #define emit_op_call() \ *dest++ = intel_call; \ v.i = (unsigned char *)ops[pc->i] - dest - 4; \ memcpy(dest, v.b, 4); \ dest += 4; \ ++pc op: switch (pc->i) { case STOP: emit_op_call(); /* hack to get execute() to only compile once */ code->p = ops[code->i]; return; case PUSH: *dest++ = intel_mov; v.p = args; memcpy(dest, v.b, 4); dest += 4; emit_op_call(); *args++ = pc->i; ++pc; goto op; case ADD: emit_op_call(); goto op; case PRINT: emit_op_call(); goto op; } } void execute(cell *code) { static int stack[1000]; if (code->i != 0) { compile(code); } vm_sp = stack; asm("call *%0" : : "r" (vm_executable)); } #endif #endif #endif #endif int main(int argc, char *argv[]) { static cell code[10000]; int n; if (argc != 2) { printf("vm-test [byte-code-file]\n"); return 1; } load(argv[1], code); printf("executing %d ops.\n", ITERS * count_ops(code)); for (n = 0; n < ITERS; ++n) { execute(code); } return 0; }