... I hope is (inline) attached below
leo
1) Indroduction

Below [1] is a small test program, which basically shows the speed of
calling generators aka coroutines. But *what* I want to discuss isn't
restricted to calling coroutines. Its the same (more or less) with
calling any subroutine-like thingy, being it a method, an overriden
operator, or an internal method, like array.sort(), if the "optmizer"
doesn't know, what kind of method gets called, because e.g. the method
was passed in as function argument.

So some timing results first:

$ time python f.py
a

real    0m0.209s

$ perl pie-thon.pl f.py > f.pir                 # [2]
$ time parrot --python -j -Oc f.pir
a

real    0m0.867s

So that's not really fast, or worse - 4 times slower - and yes, this
is already an optimized build of parrot.

So why is it slow? Here is a profile run:

$ time parrot --python -p -Oc f.pir
a

                   OPERATION PROFILE

 CODE J OP FULL NAME            CALLS  TOTAL TIME   AVG T. ms
 ---- - -----------------    --------  ----------  ----------
   39   invokecc               100004    0.163775      0.0016
  804 j shift_p_p              100002    0.149701      0.0015
   -4   DOD_collect_PMC             8    0.108252     13.5315
  751 j new_p_ic               300011    0.086560      0.0003
  910   set_p_sc               100002    0.084887      0.0008
  965 j set_p_p_k              100001    0.074723      0.0007
  995   pushtopp               100002    0.072110      0.0007
   -3   DOD_collect_buffers         8    0.065581      8.1976

The DOD runs are all triggered from the C<pushtopp> opcode that needs
new frame buffers for saving register frames in the coroutine. The
C<shift_p_p> opcode does another copy of 640 bytes (forth and back) to
preserve registers, when calling the coroutine.
(The C<invokecc> might need some optimization, but isn't the real
problem ...)

Because imagine, the main program (or the couroutine) is somewhat more
complex, and we can't emit the C<pushtopp> for just preserving the top
half of P registers only:

$ time parrot --python -p f.pir
a

                   OPERATION PROFILE

 CODE J OP FULL NAME            CALLS  TOTAL TIME   AVG T. ms
 ---- - -----------------    --------  ----------  ----------
 1002   savetop                100004    0.291580      0.0029
   -3   DOD_collect_buffers        26    0.164344      6.3209

That'll be the normal case for average programs that make use of all
Parrot register types - optimizations to use native types will have
this totally counter-productive effect of slowing down program
execution.

A lot of time is spent for pushtopp (or savetop, or saveall) and the
equivalent pops. And in the presence of coroutines or continuations
the return continuation can't be reused, we have to allocate fresh
frames for each call.

Ok, how can we preserve registers, get continuations right and run
faster.

2) The idea

The key idea is this change in vtable prototypes:

  void* invoke(Interp*, PMC* sub, void* next, Interp **frame)

(The duplication of the Interp* argument is just to keep the current
PMC compiler and other utils, which assume this vtable argument
arrangement).

3) So what is Interp** frame?

Its a new Interp* structure returned from the sub (or closure, or
continuation), when calling it the first time. This "frame" aka
interpreter has some copies of pointers of the original interpreter
(like arena_base) and has a separate register area. And the
"interpreter" keeps on running with that passed "frame".

So instead of saving all registers on each function or method call and
restoring these on return, we copy the interpreter once on creation of
that sub or coroutine or method, and then reuse these registers, if we
are inside that function.

4) CPS

The context gets a pointer to the calling frame. So a return via the
continuation restores that frame, i.e. the interpreter structure of
the caller.

5) Recursion

On each level that is deeper then the current one, a new frame is
created with a fresh set of registers. Its not cheaper then now: we
save registers on each deeper call - but not more expensive.

6) Argument/return value passing

Arguments and return values are copied from the caller to the sub or
vv. That's for sure cheaper then copying up to 640 bytes - twice.

7) Lexcicals

Lex pads are gone with that approach. Each frame has its lexical store
in the context plus a pointer to the caller, *if* its a closure.

8) DOD

We just have to mark the register frames of the chain of currently
active frames (plus of course other stuff, we are marking now). We
currently have to mark the register backing stacks, that hold the
registers, so that's basically the same. But just marking the chain of
active frames is much simpler.

10) JIT, predereferencing

That'll need almost no change. All addresses are still relative to the
interpreter, just the "interpreter" sometimes changes, which needs one
reinitialization on function entry.

I'm thinking of making each function a PackFile code segment, so that
the current scheme of JITting the code, if it isn't yet does work.
JITting or prederefernicing is per subroutine then always.

10) Implementation thoughts:

The interpreter (aka Frame) structure could look something like this:

  struct interpreter {
      [ registers ]
      [ context ]
      struct Stash *globals;
      struct Arenas *arena_base;
      struct more_state *more_stuff;
  }

Some heavily used things might go from "more_stuff" into the outer
level to save one indirection. That's already optimization work :)

11) Finally

That all sounds so simple, too simple, so that I must have missed
somthing.

Comments welcome.

leo

[1]

$cat f.py
def foo():
    i=0
    while (1):
        if i > 100000:
            return
        yield "abcdefghi"[i%5]
        i = i + 1

def main():
    i = foo()
    for y in i:
        pass
    print y

if __name__ == "__main__":
    main()

[2] in languages/python with a link to ../../parrot

Reply via email to