Mike Lambert wrote:

> Rather impressive. Except that it makes me look bad. :)

You have to follow orders - I don't any more! As I said before,
I am now doing this for fun - and when running on a slow machine,
speed becomes very important. I have been very pleased to see
that a lot of my code has made it into the official codebase, and
that you have been extending the concepts.

> Your solution, (ignoring the extra cycle counter byte for now), cannot
> handle vtable methods implemented in Parrot. The current system to

I agree that the cycle counter approach is not going to handle all
possibilities - however, it was originally dismissed because:
<< start quote from Dan
Unfortunately I don't. I'm willing to live with a performance
degredation in those spots where it's truly necessary, but only
there. The performance hit has to be localized to those places where
infant mortality is actually a problem. Those spots are currently
reasonably rare, and I'm not sure that the hoops we're building to
jump through (and I'm as guilty as anyone) are worth it.
>> end quote

I'm sure you'll agree that the stack-walking code suffers from the
same problem - we take a performance hit for every DOD run.
As Nick suggested, perhaps we need multiple methodologies.
The cycle counter approach really only has one merit - it is
transparent - and that was why I proposed it originally, because we
had multiple developers producing code that had neonate bugs.

> 2) Currently, we use linked list of buffer headers for freeing and
> allocating headers. I'm not sure what you mean by saying that they are
> sorted by bufstart? What does this buy over the current system?

I am now keeping a linked list of *allocated* buffer headers, headed
up by the Memory_Block structure, and linked in ascending order
by their bufstart value (this is how they are allocated anyway)
This is primarily used by COW-3 - when a shared header is
created, it is inserted into the linked list in the appropriate place.
Then compact_pool runs through the Memory_Block list, and finds
each header in sequence. If the header is dead (this is one of the
things I have not re-implemented yet) then it is freed and no copy
takes place; if the header overlaps the previous one (i.e. COW)
then the header is re-addressed without a copy.
This fixes the problem of a large buffer being kept around only
because a small substring is still alive. Unfortunately, it is very
expensive in terms of memory usage, as it uses a doubly-linked
list and a reference back to the pool (although the latter could
probably be eliminated.)

> 4) Isn't this really the same thing as item 3? I'm basing my knowledge off
> your old COW patches. Has additional work been done on the string function
> integration since then, or do #3 and #4 both come from those patches?
The logic described above is what I meant by #3 i.e. the GC code
is equipped to handle COW of any buffer type, not only strings.
#4 meant that only for strings is any actual COW functionality
implemented in the next layer.

> I'm not sure how much of the new code you've merged with. Which of the new
> files are you planning to integrate/merge with, and which have you thrown
> out in favor of older versions? I'm specifically referring to any of
> resources/dod/smallobject/headers.c.
I have basically cut-and-pasted my changes into your new files; because
otherwise I was going to be arrested for cruelty to CVS. My last diff
against current CVS produced a 24K patch file - I don't believe that
diff is quite accurate any more, but I have also installed Steve's patch
to test with GC_DEBUG, so I will have to do a bit of work to sort it out.
Once I have a working patch, I will email it to you directly, in case you
can find anything useful in it.

For interest, the other changes in my old version that I am still working
on integrating are:
1) Allocate memory pool in fixed size chunks, even during compaction.
In other words, instead of allocating a new block of memory equal to
the current size used, allocate a chunk (of say 16K or whatever), then
allocate another chunk when that one is full.
During compaction, once a chunk has been emptied, it is returned to
a free chunk list, and can then be reallocated. This reduces the peak
memory consumption, and allows the compaction process to be
suspended between chunks. I believe it would also make
generational collection easier, though I haven't tried that yet.
The downside is obviously wasted space at the end of each chunk;
but there are ways to minimise that. Chunk sizes are increased
dynamically as memory usage grows, and to accomodate large
allocation requests.
2) Postpone freeing of buffers. At present, the DOD run first marks
everything it can find, then frees what it couldn't. By updating the
cycle counter in the buffer header when a buffer is marked live,
the free_unused_buffers runs can be delayed until we actually need
free buffers in a specific resource (sorry, small object) pool. In
conjunction with the code I referred to above for freeing buffer
headers during a compaction run, this can reduce the frequency
of running free_unused_buffers. This requires that buffer_lives be
passed the interpreter, so it can access the cycle counter.

The other changes I made were to the string subsystem, and I
probably won't do them again because they change too many
files:
a) replace chartype and encoding by a single charset entry
This removes one pointer field from the string header, and
simplifies all the string functions, as they are forever checking
both fields to determine if transcoding is required. I do not
believe that the two fields are orthogonal, and therefore the
number of charsets would be less than #chartypes * #encodings.
b) some alterations to the single vtable thus created, in
particular the addition of a find_substring method.

--
Peter Gibbs
EmKel Systems
----- Original Message -----
To: "Peter Gibbs" <[EMAIL PROTECTED]>
Cc: "perl6-internals" <[EMAIL PROTECTED]>
Sent: 16 August 2002 07:36
Subject: Re: [INFO] The first pirate parrot takes to the air


> > For purely academic purposes, I have re-synchronised some of my
> > forbidden code with the latest CVS version of Parrot. All tests pass
> > without gc debug; and with gc_debug plus Steve Fink's patch.
> > Benchmarks follow, on a 166MHz Pentium running linux 2.2.18.
> >
> >                                      Parrot      African Grey
> > life (5000 generations)           172 seconds      81 seconds
> > reverse <core_ops.c >/dev/null    193 seconds     130 seconds
> > hanoi 14 >/dev/null                51 seconds      37 seconds
>
>
> > The differences between the two versions are:
> > 1) Use of the interpreter cycle-counter instead of stack walking.
> > 2) Linked lists of buffer headers sorted by bufstart
> > 3) COW-supporting code in GC (for all buffer objects)
> > 4) Implementation of COW for string_copy and string_substr
>
> 1) Yeah, the approach of cycle-counter is a nice one. I also had a similar
> solution involving neonate flag usage, somewhere in the archives. Both
> have *significant* speed advantages versus the curent codebase's stack
> walking.
>
> I tried to convince Dan of the merit, but they failed for various reasons:
>
> implement this involves the interpreter recursively calling runops_core to
> handle the vtable method. If you increment cycles on the inner loop, you
> risk pre-collection of stuff on the stack of the vtable method calling
> stuff.  If you don't increment cycles, you prevent any of the memory
> allocated inside of this vtable method from ever being collected during
> the method execution...bad stuff when your vtable methods are multiplying
> gigantic matrices or somesuch.
>
> My neonate buffers solution fails only in the presence of longjmp.
>
> Granted, we don't do any of this yet, so these solutions will mop the
> floor with my current stackwalk code, and pass tests to boot. But it's the
> planned introduction of these requirements which are the reason for making
> these solutions 'forbidden'.
>
> One of Nick's solutions was to fallback on stack-walking to handle the
> cases where our faster solutions fail. I can definitely see that working
> with neonate buffers to handle the fixups needed after a longjmp call. But
> it doesn't seem as attractive in the presence of your solution, for which
> it would require stackwalking for all re-entrant runops calls. Do you have
> another solutioin in mind for handling re-entrant runops calls?
>
> As far as the extra byte in the buffer, I don't mind that one at all.
> There are a lot of restrictions on the GC code in the interest of making
> stuff "lightweight". Unfortuantely, GC code takes a significant portion of
> the execution time in any realistic application. Hopefully we can convince
> Dan to allow extra fields in the buffers in the interest of speed, but I
> don't think we can reduce parrot/perl6's feature set in the interest of
> speed...might as well use C if that's what you want. :)
>
>
> 3) Definitely a good one. I've been trying to merge your original COW
> patch into my code here. Without GC_DEBUG, it fails one test. With
> GC_DEBUG, it fails the traditional set plus that one test. The test case
> is rather large unfortunately, I haven't been able to narrow down the
> problem further or I'd have committed it.
>
>
> > Some of the changes I made before the memory management
> > code was totally reorganised  have not yet been re-integrated.
> > My last version prior to that reorganisation ran 5000 lives in
> > 61 seconds, and I hope to get back to somewhere close to
> > that again.
>
>
> Regardless of whether or not it goes in, I'd be interested in seeing a
> patch. I can work on integrating a lot of your non-forbidden code into the
> current codebase.
>
> Thanks for spending the time to generate these numbers...they're a nice
> eyeopener on what can be done without the current restrictions. Hopefully
> they'll allow us to reconsider each restriction in the context of
> the speed of our GC.
>
> Mike Lambert

Reply via email to