Dan Sugalski <[EMAIL PROTECTED]> wrote:
> I've been thinking about vtable and opcode functions written in
> bytecode, and I think that we need an alternate form of sub calling.
> (And yes, this *is* everyone's chance to say "I told you so")

I don't think that calling conventions are actually a problem with
overridden vtable method or such. Just the opposite - they are fine. A
HLL compiler and imcc both know one way to spit out the code for a sub -
being it a normal one a method call or an overridden vtable or opcode
method.

But - and that's likely the reason of your mail - it's a bit slow (not
horribly any more - but still slow).

So let's investigate the individual steps of an overridden vtable
method call, i.e. the delegate code:

1) Register preserving
  We can't do much about that - except we switch to a scheme, where such
  method functions have to preserve their registers - but that violates
  symmetry and is a penalty if such a function is called directly.
  Register preserving is optimized - it reuses allocated register frame
  memory with a free list and doesn't take much time.

2) Method lookup
  That's currently two hash look ups: one for the namespace one for the
  method. I've speeded up that by using hash functions instead of
  PerlHash interface. Using a method cache (or getting the namespace PMC
  out of loop) reduces that to one hash look up.

3) Setting up registers according to PCC
  This boils down to nothing with the JIT core (~6 machine instructions)
  The fib benchmark shows that nicely.

4) Setting up method arguments
  That's currently using the signature string. It loops over the
  signature and gets va_list type arguments passed in into registers.
  Shouldn't take much time - we typically have 3 arguments only
  Could be hard coded again like in your first version. OTOH this is
  code bloat.

5) Creating a return continuation. Could be optimized away, *if* we know
  that's always a method sub and is run in its own interpreter loop.
  An <end> opcode would do it. OTOH we might need it to restore some
  context items. We could keep some return continuations around (in a
  free_list) and only update their context: s. the C<updatecc> opcode.

6) Reentering the run loop
  These needs currently 5 function calls:
  - runops pushes a new Parrot_exception
  - runops_ex is a currently needed ugly hack to allow intersegment
      branches (i.e. evaled code has a "goto main" inside)
  - runops_int handles resumable opcodes like C<trace>
  - runops_xxx does run loop specific setup, like JITting the code
    if it isn't yet JITted.
  - the runloop itself finally
  We can call to some inner runops, if a method call doesn't need all
  this setup. We can also call a specialized runops-wrapper that
  shortcuts this setup. Doesn't achieve much though s. below.

7) Leaving all these run loops
8) return value handling, if any
9) register frame restore

Done.

So when above sequence is run for a new object, we additionally have
object construction.

10) _instantiate_object
  It was already discussed how to speed that up with a different object
  layout and not using any aggregate PMC containers.

Finally some current timing results (parrot -O3, Athlon 800, JIT core)

 Create 100.000 new PerlInts + 100.000 invokecc __init  0.24 s
 Create 100.000 new delegate PMCs and call __init       0.60 s
                same, call runops_int directly          0.57 s
 Create 100.000 new objects and call __init             1.00 s

Object instantiation is 40 % of the whole used time. Let's start to
optimize object layout first.

leo

Reply via email to