Steve Fink wrote:
> I just did all that because the magnitude of the difference between
> using separate functions and using the closer-to-assembly solutions
> (switch and goto) seemed too large. I can easily believe they're faster,
> but not by that much, and I didn't want to unnecessarily lose the
> flexibility, readability, and compilability of keeping everything
> cleanly separated out into separate functions.

My original purpose was just to play with dispatchers -- we really
won't know what's the best solution until more of Perl 6 starts to
evolve. The dispatcher is probably a very tiny piece of the performance
picture anyways, so I don't think it matters (much) that computed goto
is 10x faster than function call.

The only VMs sensitive to dispatch seem to be those with teeny weeny
instructions, like Java or actual hardware devices. Since perl has
*huge* instructions it doesn't spend much time in the dispatcher.

I think the best architecture would be one with two dispatch systems.
A really fast/simple one (possibly obscure) that's used for small
instructions like array indexing, counter increments, sub calls, etc.
and another flexible/general one (possibly obscure ;) that's used
for the "big" instructions like sort and eval. Maybe we could even
force all those big instructions into the vtbl.

The different function call dispatchers that you posted pretty much
round out the simple dispatch techniques. The only other one I'd like
to see is a binary decision tree rather than a sequential scan. (The
tests could be chosen to maximize branch predictions too.) Seems hard
to believe that there's a machine out there that could execute
several test and branch instructions in the time it takes to do a
table lookup though.

Anways, now for something *completely* different...

I built a simple TIL and also expanded the opcode sets for the switch
and goto based dispatchers. Here's the timing (same conditions as before)
for -O only:

function call (original)              12.6 secs
switch w/ expanded opcodes             6.2 secs
naive TIL                              5.9 secs
computed goto w/ expanded opcodes      1.4 secs

That's not a typo for "native" -- my TIL is incredibly naive. I had
a tough time getting gcc to work with my dispatcher. It does a really
good job at "dead" code elimination and other optimizations that make
it really hard to interface with. Basically I'm stuck doing normal
function calls because anything more either breaks with optimization
or severely constrains coding style.

At least the execute() function has become pretty simple now. ;) All
it does is verify the code has been compiled and then:

  asm("call *%0" : : "r" (vm_executable));

The compiler (threader?) generates machine code in vm_executable and
opcode arguments in vm_data. Here's what vm_executable looks like:

(gdb) x/11i vm_executable
0x805e1c0 <vm_executable>:      mov    $0x8054580,%eax
0x805e1c5 <vm_executable+5>:    call   0x80485d4 <op_push>
0x805e1ca <vm_executable+10>:   mov    $0x8054584,%eax
0x805e1cf <vm_executable+15>:   call   0x80485d4 <op_push>
0x805e1d4 <vm_executable+20>:   call   0x80485e8 <op_add>
0x805e1d9 <vm_executable+25>:   mov    $0x8054588,%eax
0x805e1de <vm_executable+30>:   call   0x80485d4 <op_push>
0x805e1e3 <vm_executable+35>:   call   0x80485e8 <op_add>
0x805e1e8 <vm_executable+40>:   mov    $0x805458c,%eax
0x805e1ed <vm_executable+45>:   call   0x80485d4 <op_push>
0x805e1f2 <vm_executable+50>:   call   0x80485e8 <op_add>

The register assignments you see there are for the opcodes that
take arguments. When vm_executable is built, an argument array is
also built so that vm_executable can stay pure code. Here's an
example of an opcode that takes arguments:

void op_push()
{
    register int *args asm("eax");
    *vm_sp++ = *args;
}

The TIL uses the machine stack, but it should be very easy to
swap stacks since we have to get cozy with the hardware anwyays.
Light-weight threads of "pure" Perl would be super easy.

At this point I'm just surprised it works at all. I have no
idea if the instructions I'm cranking out are efficient. (Should
they be aligned? Is there a faster mov?) I think it should be
pretty simple to emit test and branch instructions directly, so
in practice I'd expect this TIL to perform better than it does
now.

It seems like a good idea to try and coordinate efforts with
another group doing native code generation (like blackdown.org).
If we did inlining or peephole optimization I'm sure we'd be
re-inventing stuff they've already started on.

Anybody have ideas for making this *fast*?

- Ken

P.S. Sorry for that inane add.dat file I posted. I should have
posted the 10 line script that generated it. Anyways, nobody
chewed me out for posting MIME attachments so I guess I'll keep
doing it...

/* This code is in the public domain. Ken Fox, Nov 2000 */

#define ITERS 100000
#define USE_TIL

#include <stdlib.h>
#include <stdio.h>

typedef union {
    int i;
    void *p;
    unsigned char b[4];
} cell;

void load(char *filename, cell *code)
{
    FILE *input;
    cell *pc = code;

    if (input = fopen(filename, "r")) {
        int i;
        while (fscanf(input, " %d ", &i) != EOF) {
            pc->i = i;
            ++pc;
        }
        fclose(input);
    }

    pc->i = 0;
}

void dump(cell *code)
{
    cell *pc = code;

    printf("byte code:\n");
    while (pc->i) {
        printf("  %5d\n", pc->i);
        ++pc;
    }
}

enum {
    STOP = 0,
    PUSH = 1,
    ADD = 2,
    PRINT = 3,
    POP = 4,
    SUB = 5,
    NEG = 6,
    MUL = 7,
    DIV = 8
};

int count_ops(cell *code)
{
    int icount = 0;

    cell *pc = code;

op: switch (pc->i) {
        case STOP:
            return icount;

        case PUSH:
            ++icount;
            pc += 2;
            goto op;

        case ADD:
            ++icount;
            pc += 1;
            goto op;

        case PRINT:
            ++icount;
            pc += 1;
            goto op;
    }
}

void fixup(cell *code, void *ops[])
{
    cell *pc = code;

op: switch (pc->i) {
        case STOP:
            pc->p = ops[pc->i];
            return;

        case PUSH:
            pc->p = ops[pc->i];
            pc += 2;
            goto op;

        case ADD:
            pc->p = ops[pc->i];
            pc += 1;
            goto op;

        case PRINT:
            pc->p = ops[pc->i];
            pc += 1;
            goto op;
    }
}

#ifdef USE_SWITCH

void execute(cell *code)
{
    static int stack[1000];
    cell *pc = code;
    int *sp = stack;
    int x, y;

op: switch (pc++->i) {
        case STOP:
            return;

        case PUSH:
            *sp++ = pc++->i;
            goto op;

        case POP:
            --sp;
            goto op;

        case ADD:
            y = *--sp;
            x = *--sp;
            *sp++ = x + y;
            goto op;

        case SUB:
            y = *--sp;
            x = *--sp;
            *sp++ = x - y;
            goto op;

        case NEG:
            x = *--sp;
            *sp++ = -x;
            goto op;

        case MUL:
            y = *--sp;
            x = *--sp;
            *sp++ = x * y;
            goto op;

        case DIV:
            y = *--sp;
            x = *--sp;
            *sp++ = x / y;
            goto op;

        case PRINT:
            printf("--> %d\n", *--sp);
            goto op;
    }
}

#else
#ifdef USE_FUNCALL

typedef void *(*op)();

cell *vm_pc;
int *vm_sp;

void *op_stop()
{
    return 0;
}

void *op_push()
{
    *vm_sp++ = vm_pc[1].i;
    return vm_pc + 2;
}

void *op_add()
{
    int y = *--vm_sp;
    int x = *--vm_sp;
    *vm_sp++ = x + y;
    return vm_pc + 1;
}

void *op_print()
{
    printf("--> %d\n", *--vm_sp);
    return vm_pc + 1;
}

void execute(cell *code)
{
    static void *ops[] = { op_stop, op_push, op_add, op_print };
    static int stack[1000];

    if (code->i > 0 && code->i < 255) {
        fixup(code, ops);
    }

    vm_pc = code;
    vm_sp = stack;

    while (vm_pc = ((op)(vm_pc->p))())
        ;
}

#else
#ifdef USE_GOTO

void execute(cell *code)
{
    static void *ops[] = { &&op_stop, &&op_push, &&op_add, &&op_print,
                           &&op_pop, &&op_sub, &&op_neg, &&op_mul, &&op_div };
    static int stack[1000];
    cell *pc = code;
    int *sp = stack;
    int x, y;

    if (code->i > 0 && code->i < 255) {
        fixup(code, ops);
    }

    goto *((pc++)->p);

op_stop:
    return;

op_push:
    *sp++ = pc++->i;
    goto *((pc++)->p);

op_pop:
    --sp;
    goto *((pc++)->p);

op_add:
    y = *--sp;
    x = *--sp;
    *sp++ = x + y;
    goto *((pc++)->p);

op_sub:
    y = *--sp;
    x = *--sp;
    *sp++ = x - y;
    goto *((pc++)->p);

op_neg:
    x = *--sp;
    *sp++ = -x;
    goto *((pc++)->p);

op_mul:
    y = *--sp;
    x = *--sp;
    *sp++ = x * y;
    goto *((pc++)->p);

op_div:
    y = *--sp;
    x = *--sp;
    *sp++ = x / y;
    goto *((pc++)->p);

op_print:
    printf("--> %d\n", *--sp);
    goto *((pc++)->p);
}

#else
#ifdef USE_TIL

unsigned char vm_executable[40000];
int vm_data[10000];
int *vm_sp;

int op_stop()
{
    int x;
    asm("pop %0" : "=g" (x));
    return x; /* return to execute() */
}

void op_push()
{
    register int *args asm("eax");
    *vm_sp++ = *args;
}

void op_add()
{
    int x = *--vm_sp;
    int y = *--vm_sp;
    *vm_sp++ = x + y;
}

void op_print()
{
    printf("--> %d\n", *--vm_sp);
}

void compile(cell *code)
{
    static void *ops[] = { op_stop, op_push, op_add, op_print };

    /* put a sequence of

         register eax = vm_data + offset;
         asm("call %0" : : "g" opcode);

       into vm_executable. The register assignment is only
       necessary if the opcode takes arguments.

       * 80x86 near direct call is:

           11101000 disp32
           0xe8

         disp32 = destination_addr - addr_of_next_instruction

       * 80x86 code for 32-bit constant assignment to the eax
         register is:

           10111000 constant_value
           0xb8

         All of the above 32-bit integers are in little endian
         format. Put the numbers into the target executable using:

           cell v;
           v.i = ...;
           memcpy(dest, v.b, 4);
           dest += 4;
    */

#define intel_call  0xe8
#define intel_mov   0xb8

    unsigned char *dest = vm_executable;
    int *args = vm_data;
    cell *pc = code;
    cell v;

#define emit_op_call() \
            *dest++ = intel_call; \
            v.i = (unsigned char *)ops[pc->i] - dest - 4; \
            memcpy(dest, v.b, 4); \
            dest += 4; \
            ++pc

op: switch (pc->i) {
        case STOP:
            emit_op_call();
            /* hack to get execute() to only compile once */
            code->p = ops[code->i];
            return;

        case PUSH:
            *dest++ = intel_mov;
            v.p = args;
            memcpy(dest, v.b, 4);
            dest += 4;
            emit_op_call();
            *args++ = pc->i;
            ++pc;
            goto op;

        case ADD:
            emit_op_call();
            goto op;

        case PRINT:
            emit_op_call();
            goto op;
    }
}

void execute(cell *code)
{
    static int stack[1000];

    if (code->i != 0) {
        compile(code);
    }

    vm_sp = stack;

    asm("call *%0" : : "r" (vm_executable));
}

#endif
#endif
#endif
#endif

int main(int argc, char *argv[])
{
    static cell code[10000];
    int n;

    if (argc != 2) {
        printf("vm-test [byte-code-file]\n");
        return 1;
    }

    load(argv[1], code);

    printf("executing %d ops.\n", ITERS * count_ops(code));
    for (n = 0; n < ITERS; ++n) {
        execute(code);
    }

    return 0;
}

Reply via email to