Dan Sugalski <[EMAIL PROTECTED]> wrote: > I've been thinking about vtable and opcode functions written in > bytecode, and I think that we need an alternate form of sub calling. > (And yes, this *is* everyone's chance to say "I told you so")
I don't think that calling conventions are actually a problem with overridden vtable method or such. Just the opposite - they are fine. A HLL compiler and imcc both know one way to spit out the code for a sub - being it a normal one a method call or an overridden vtable or opcode method. But - and that's likely the reason of your mail - it's a bit slow (not horribly any more - but still slow). So let's investigate the individual steps of an overridden vtable method call, i.e. the delegate code: 1) Register preserving We can't do much about that - except we switch to a scheme, where such method functions have to preserve their registers - but that violates symmetry and is a penalty if such a function is called directly. Register preserving is optimized - it reuses allocated register frame memory with a free list and doesn't take much time. 2) Method lookup That's currently two hash look ups: one for the namespace one for the method. I've speeded up that by using hash functions instead of PerlHash interface. Using a method cache (or getting the namespace PMC out of loop) reduces that to one hash look up. 3) Setting up registers according to PCC This boils down to nothing with the JIT core (~6 machine instructions) The fib benchmark shows that nicely. 4) Setting up method arguments That's currently using the signature string. It loops over the signature and gets va_list type arguments passed in into registers. Shouldn't take much time - we typically have 3 arguments only Could be hard coded again like in your first version. OTOH this is code bloat. 5) Creating a return continuation. Could be optimized away, *if* we know that's always a method sub and is run in its own interpreter loop. An <end> opcode would do it. OTOH we might need it to restore some context items. We could keep some return continuations around (in a free_list) and only update their context: s. the C<updatecc> opcode. 6) Reentering the run loop These needs currently 5 function calls: - runops pushes a new Parrot_exception - runops_ex is a currently needed ugly hack to allow intersegment branches (i.e. evaled code has a "goto main" inside) - runops_int handles resumable opcodes like C<trace> - runops_xxx does run loop specific setup, like JITting the code if it isn't yet JITted. - the runloop itself finally We can call to some inner runops, if a method call doesn't need all this setup. We can also call a specialized runops-wrapper that shortcuts this setup. Doesn't achieve much though s. below. 7) Leaving all these run loops 8) return value handling, if any 9) register frame restore Done. So when above sequence is run for a new object, we additionally have object construction. 10) _instantiate_object It was already discussed how to speed that up with a different object layout and not using any aggregate PMC containers. Finally some current timing results (parrot -O3, Athlon 800, JIT core) Create 100.000 new PerlInts + 100.000 invokecc __init 0.24 s Create 100.000 new delegate PMCs and call __init 0.60 s same, call runops_int directly 0.57 s Create 100.000 new objects and call __init 1.00 s Object instantiation is 40 % of the whole used time. Let's start to optimize object layout first. leo