I've committed a couple of minor optimizations which speed up Rakudo and Perl OO in general by about 35%. There may be a few more lurking, but I keep running into three spots which dominate most of the other optimization strategies I might pursue.
1) The Garbage Collector is algorithmically inefficient. There are various potential optimization strategies here which don't require us walking every allocated header in the system multiple times (yes, it's that bad in at least two places), but the real solution I suspect is to replace the current GC with at least the one specified in the new GC PDD. The two worst spots are turning off the live flag (and destroying everything with a custom destroy flag) and dividing newly-allocated memory pools into headers and manually adding each header to a linked-list of free objects. Flag marking would be much cheaper (and likely cache-friendlier, though I haven't done the math to make sure of this) if we used some sort of card-marking scheme, where we can scan 8 or 16 or 32 or 64 bits in a bitmap at a time to see if we need to do anything in specific for objects, rather than walking through all of the memory pools in the system and checking flags, one header at a time. It might also make our PObjs a couple of bytes smaller, which puts that much less pressure on the GC and means we have to run a full GC that much less frequently. I really recommend that the new GC use this approach. The free object list is the reason that compacting/copying collectors are popular, specifically that all you have to do to find the next free object is bump the pointer by sizeof (header) and see if that's still within the bounds of the pool. We could ameliorate that somewhat by using a hybrid scheme of pointer bumping and then free-list checking, which would remove the cost of dividing fresh pools into headers and making the linked list at the cost of a little more complexity in the allocator. I'm not convinced this is an appropriate tradeoff for the 10% improvement it's likely to give. A copying or collecting scheme would give us this benefit almost for free (where the only cost is replacing the entire garbage collector and rewriting a fair chunk of the internals of the system). With that all said, one good optimization is to allocate fewer PObjs, so if there's a hotspot in either PIR or C where you create a new PMC or STRING you don't always need (or you could cache it), that's almost always a minor optimization. (It's almost always worth doing only in hotspots though, so we have to be able to identify those in PIR... soon.) 2) PIR vtable overrides are very naive. See src/oo.c, specifically Parrot_oo_find_vtable_override. Currently there are very few PIR-level overrides of vtable entries, but the "Is there an override?" scheme here walks through all of the parents of a Class every time something wants to check if there's a vtable override -- and both the Class and Object PMCs check for overrides, namely get_class and [gs]et_attr_str. I made a quick and dirty optimization to this function to check only the current class and it gave me a 12.5% speed improvement over the full check, running my NQP/Rakudo benchmark. 'make coretest' all passed as well. We have a dynamic system such that we can add and remove parents as well as methods and vtable entries at runtime, but we should still be able to cache this information somehow such that we don't continually look for things that don't exist, probably won't exist, and certainly haven't sprung into existence since the most recent check. I don't have a complete answer for this yet, but we should be able to have smarter caching without removing the essential dynamism from the system. 3) PCC argument processing is slow. I've looked over Parrot_pass_args and Parrot_process_args several times in the past few months, and I didn't see any obvious speedups or tricks. However, we do spend a lot of time shuffling data back and forth, and something (instinct, blind desire, lower blood sugar than normal) suggests that we already *know* some of this information already. That is, at the point of a call we know if there are slurpy or named arguments and if the invoked function wants to bind slurpy or named parameters. It seems like we could have a fast path for that common operation... somehow. Suggestions, analysis, and patches are all very much welcome. -- c