In tracing down the next crash bug using GC_DEBUG, I found that the
following code in new_stack is unsafe:

    stack->buffer = new_buffer_header(interpreter);
    Parrot_allocate(interpreter, stack->buffer,
                    sizeof(Stack_Entry_t) * STACK_CHUNK_DEPTH);

A GC can be triggered by Parrot_allocate that frees the newly created
buffer. This shows up when creating interpreter->ctx.pad_stack.

This is really the same problem as the hash->buffer bug. The system
stack walker will find the stack variable, but won't know how to look
inside it to find the buffer.

I can think of the following ways of fixing this (these are all
rehashes of things we've used or discussed in the past):

1. Also putting stack->buffer into a local variable and praying that
   it doesn't get optimized away
2. Disabling GC during the Parrot_allocate() call
3. Passing in an extra OUT parameter to new_stack so we can anchor the
   stack before the Parrot_allocate() call.
4. Tagging the newly created number with a generation counter so it
   won't get freed until after the next opcode is run (in this case,
   that'll be the first opcode). (This requires implementing Peter
   Gibbs' (?) idea of not freeing things created in the current generation.)
5. Stuffing the new buffer into a temporary holding area that is
   added to the root set.
6. Setting a neonate flag on the buffer that'll get automatically get
   cleared at an appropriate time.
7. Setting an immune flag on the buffer and manually clearing it after
   anchoring the stack to the root set.
8. Wrapping the stack in a PMC (so the stack struct would go into a
   ->data buffer and would be traversed with a custom mark routine) and
   being very, very careful about the order of initialization.

Any votes?

I have also come to the conclusion that tracking down these memory
bugs is way too difficult right now. I tracked the above problem back
from a seg fault, which was resulting from a garbage value in the byte
code stream, which was triggered by adding a PMC to the free list
until I added a bunch of debugging code. With the debugging code, I
could see that the dead userargv PMC (that wasn't really dead because
it was on the system stack even though it would never get used again)
pointed to a PMC with a bogus type, but also the memory corruption was
now triggered by a stack push where the stack and some PMC incorrectly
shared a buffer.

Or, in summary: these kinds of problems result in memory corruption
that is unlikely to be immediately fatal.

I wonder if we could add something to GC_DEBUG that would mark all
dead Buffers and PMCs in an easily recognizable way, and then check
all live PMCs and Buffers to be sure that nothing they point to is
marked as being dead. (Right now, they all just unconditionally mark
things as "live for this DOD pass".)

Reply via email to