The Big Three Rakudo (and Parrot OO) Bottlenecks

chromatic Fri, 11 Apr 2008 17:29:12 -0700

I've committed a couple of minor optimizations which speed up Rakudo and Perl 
OO in general by about 35%.  There may be a few more lurking, but I keep 
running into three spots which dominate most of the other optimization 
strategies I might pursue.


1) The Garbage Collector is algorithmically inefficient.  There are various 
potential optimization strategies here which don't require us walking every 
allocated header in the system multiple times (yes, it's that bad in at least 
two places), but the real solution I suspect is to replace the current GC 
with at least the one specified in the new GC PDD.

The two worst spots are turning off the live flag (and destroying everything 
with a custom destroy flag) and dividing newly-allocated memory pools into 
headers and manually adding each header to a linked-list of free objects.

Flag marking would be much cheaper (and likely cache-friendlier, though I 
haven't done the math to make sure of this) if we used some sort of 
card-marking scheme, where we can scan 8 or 16 or 32 or 64 bits in a bitmap 
at a time to see if we need to do anything in specific for objects, rather 
than walking through all of the memory pools in the system and checking 
flags, one header at a time.  It might also make our PObjs a couple of bytes 
smaller, which puts that much less pressure on the GC and means we have to 
run a full GC that much less frequently.  I really recommend that the new GC 
use this approach.

The free object list is the reason that compacting/copying collectors are 
popular, specifically that all you have to do to find the next free object is 
bump the pointer by sizeof (header) and see if that's still within the bounds 
of the pool.  We could ameliorate that somewhat by using a hybrid scheme of 
pointer bumping and then free-list checking, which would remove the cost of 
dividing fresh pools into headers and making the linked list at the cost of a 
little more complexity in the allocator.  I'm not convinced this is an 
appropriate tradeoff for the 10% improvement it's likely to give.  A copying 
or collecting scheme would give us this benefit almost for free (where the 
only cost is replacing the entire garbage collector and rewriting a fair 
chunk of the internals of the system).

With that all said, one good optimization is to allocate fewer PObjs, so if 
there's a hotspot in either PIR or C where you create a new PMC or STRING you 
don't always need (or you could cache it), that's almost always a minor 
optimization.  (It's almost always worth doing only in hotspots though, so we 
have to be able to identify those in PIR... soon.)

2) PIR vtable overrides are very naive.  See src/oo.c, specifically
Parrot_oo_find_vtable_override.  Currently there are very few PIR-level 
overrides of vtable entries, but the "Is there an override?" scheme here 
walks through all of the parents of a Class every time something wants to 
check if there's a vtable override -- and both the Class and Object PMCs 
check for overrides, namely get_class and [gs]et_attr_str.  I made a quick 
and dirty optimization to this function to check only the current class and 
it gave me a 12.5% speed improvement over the full check, running my 
NQP/Rakudo benchmark.  'make coretest' all passed as well.

We have a dynamic system such that we can add and remove parents as well as 
methods and vtable entries at runtime, but we should still be able to cache 
this information somehow such that we don't continually look for things that 
don't exist, probably won't exist, and certainly haven't sprung into 
existence since the most recent check.

I don't have a complete answer for this yet, but we should be able to have 
smarter caching without removing the essential dynamism from the system.

3) PCC argument processing is slow.  I've looked over Parrot_pass_args and 
Parrot_process_args several times in the past few months, and I didn't see 
any obvious speedups or tricks.  However, we do spend a lot of time shuffling 
data back and forth, and something (instinct, blind desire, lower blood sugar 
than normal) suggests that we already *know* some of this information 
already.  That is, at the point of a call we know if there are slurpy or 
named arguments and if the invoked function wants to bind slurpy or named 
parameters.  It seems like we could have a fast path for that common 
operation... somehow.

Suggestions, analysis, and patches are all very much welcome.

-- c

The Big Three Rakudo (and Parrot OO) Bottlenecks

Reply via email to