... I hope is (inline) attached below leo
1) Indroduction
Below [1] is a small test program, which basically shows the speed of calling generators aka coroutines. But *what* I want to discuss isn't restricted to calling coroutines. Its the same (more or less) with calling any subroutine-like thingy, being it a method, an overriden operator, or an internal method, like array.sort(), if the "optmizer" doesn't know, what kind of method gets called, because e.g. the method was passed in as function argument. So some timing results first: $ time python f.py a real 0m0.209s $ perl pie-thon.pl f.py > f.pir # [2] $ time parrot --python -j -Oc f.pir a real 0m0.867s So that's not really fast, or worse - 4 times slower - and yes, this is already an optimized build of parrot. So why is it slow? Here is a profile run: $ time parrot --python -p -Oc f.pir a OPERATION PROFILE CODE J OP FULL NAME CALLS TOTAL TIME AVG T. ms ---- - ----------------- -------- ---------- ---------- 39 invokecc 100004 0.163775 0.0016 804 j shift_p_p 100002 0.149701 0.0015 -4 DOD_collect_PMC 8 0.108252 13.5315 751 j new_p_ic 300011 0.086560 0.0003 910 set_p_sc 100002 0.084887 0.0008 965 j set_p_p_k 100001 0.074723 0.0007 995 pushtopp 100002 0.072110 0.0007 -3 DOD_collect_buffers 8 0.065581 8.1976 The DOD runs are all triggered from the C<pushtopp> opcode that needs new frame buffers for saving register frames in the coroutine. The C<shift_p_p> opcode does another copy of 640 bytes (forth and back) to preserve registers, when calling the coroutine. (The C<invokecc> might need some optimization, but isn't the real problem ...) Because imagine, the main program (or the couroutine) is somewhat more complex, and we can't emit the C<pushtopp> for just preserving the top half of P registers only: $ time parrot --python -p f.pir a OPERATION PROFILE CODE J OP FULL NAME CALLS TOTAL TIME AVG T. ms ---- - ----------------- -------- ---------- ---------- 1002 savetop 100004 0.291580 0.0029 -3 DOD_collect_buffers 26 0.164344 6.3209 That'll be the normal case for average programs that make use of all Parrot register types - optimizations to use native types will have this totally counter-productive effect of slowing down program execution. A lot of time is spent for pushtopp (or savetop, or saveall) and the equivalent pops. And in the presence of coroutines or continuations the return continuation can't be reused, we have to allocate fresh frames for each call. Ok, how can we preserve registers, get continuations right and run faster. 2) The idea The key idea is this change in vtable prototypes: void* invoke(Interp*, PMC* sub, void* next, Interp **frame) (The duplication of the Interp* argument is just to keep the current PMC compiler and other utils, which assume this vtable argument arrangement). 3) So what is Interp** frame? Its a new Interp* structure returned from the sub (or closure, or continuation), when calling it the first time. This "frame" aka interpreter has some copies of pointers of the original interpreter (like arena_base) and has a separate register area. And the "interpreter" keeps on running with that passed "frame". So instead of saving all registers on each function or method call and restoring these on return, we copy the interpreter once on creation of that sub or coroutine or method, and then reuse these registers, if we are inside that function. 4) CPS The context gets a pointer to the calling frame. So a return via the continuation restores that frame, i.e. the interpreter structure of the caller. 5) Recursion On each level that is deeper then the current one, a new frame is created with a fresh set of registers. Its not cheaper then now: we save registers on each deeper call - but not more expensive. 6) Argument/return value passing Arguments and return values are copied from the caller to the sub or vv. That's for sure cheaper then copying up to 640 bytes - twice. 7) Lexcicals Lex pads are gone with that approach. Each frame has its lexical store in the context plus a pointer to the caller, *if* its a closure. 8) DOD We just have to mark the register frames of the chain of currently active frames (plus of course other stuff, we are marking now). We currently have to mark the register backing stacks, that hold the registers, so that's basically the same. But just marking the chain of active frames is much simpler. 10) JIT, predereferencing That'll need almost no change. All addresses are still relative to the interpreter, just the "interpreter" sometimes changes, which needs one reinitialization on function entry. I'm thinking of making each function a PackFile code segment, so that the current scheme of JITting the code, if it isn't yet does work. JITting or prederefernicing is per subroutine then always. 10) Implementation thoughts: The interpreter (aka Frame) structure could look something like this: struct interpreter { [ registers ] [ context ] struct Stash *globals; struct Arenas *arena_base; struct more_state *more_stuff; } Some heavily used things might go from "more_stuff" into the outer level to save one indirection. That's already optimization work :) 11) Finally That all sounds so simple, too simple, so that I must have missed somthing. Comments welcome. leo [1] $cat f.py def foo(): i=0 while (1): if i > 100000: return yield "abcdefghi"[i%5] i = i + 1 def main(): i = foo() for y in i: pass print y if __name__ == "__main__": main() [2] in languages/python with a link to ../../parrot