Re: Now crashing [was Re: guile-2.9.2 and threading

Linas Vepstas Sun, 21 Jul 2019 14:12:12 -0700

How utterly embarrassing.  Please ignore most of this verbose and difficult
email chain. Yes, guile-2.9.2 is still crashing, but almost all of my
analysis was wrong. Turns out that my scheme code was calling `(10)` i.e.
taking an integer, and treating it as a function, and attempting to call
it. So the call to `scm_error` was exactly right. It was invisible to me
because ... it was ignored in my code.


However -- if one does call `scm_error` fairly rapidly, from multiple
threads, one will eventually hit a race condition and get a crash.  I'm not
sure how to create a mini-test-case for this within guile; my code is
creating threads outside of guile, and launching `scm_eval` in each (and
ignoring the resulting error).  This was leading to a crash after 5-10
minutes.

-- Linas

On Wed, Jul 17, 2019 at 10:52 PM Linas Vepstas <[email protected]>
wrote:

> Oh, I get it.  I think the bug is this:  VM_DEFINE_OP (7,
> return_values,...)
> finds some mcode, and calls it.  What it found was the
> emit_get_callee_vcode
> but it is totally pointless to call this mcode, since we're returning, and
> not
> calling. So its just not useful.
>
> Worse, it gets called with garbage values, which are then silenced by
> ignoring
> the resulting  scm_error, and everything appears to run smoothly ... for a
> while.
> Until some later time, (millions of calls later), when there is a
> completely unrelated
> race condition that causes the scm_error to get tangled and die.  The
> ideal
> solution would be simply to not call the mcode for get_callee; that would
> save
> time and trouble.
>
> That's my hypothesis. I tried to test a mock-up of this solution with the
> patch
> below, but it is too simplistic t actually work (null pointer-deref.)  I
> con't find
> a beter solution
>
> If you've got a better idea, let me know...
>
> -- Linas
>
> --- a/libguile/vm-engine.c
> +++ b/libguile/vm-engine.c
> @@ -553,6 +553,7 @@ VM_NAME (scm_thread *thread)
>            mcode = SCM_FRAME_MACHINE_RETURN_ADDRESS (old_fp);
>            if (mcode && mcode != scm_jit_return_to_interpreter_trampoline)
>              {
> +              VP->unused = 1;
>                scm_jit_enter_mcode (thread, mcode);
>                CACHE_REGISTER ();
>                NEXT (0);
> diff --git a/libguile/vm.c b/libguile/vm.c
> index d7b1788..8e178c7 100644
> --- a/libguile/vm.c
> +++ b/libguile/vm.c
> @@ -620,6 +620,7 @@ scm_i_vm_prepare_stack (struct scm_vm *vp)
>    vp->compare_result = SCM_F_COMPARE_NONE;
>    vp->engine = vm_default_engine;
>    vp->trace_level = 0;
> +  vp->unused = 0;
>  #define INIT_HOOK(h) vp->h##_hook = SCM_BOOL_F;
>    FOR_EACH_HOOK (INIT_HOOK)
>  #undef INIT_HOOK
> @@ -1515,6 +1516,7 @@ get_callee_vcode (scm_thread *thread)
>
>    vp->ip = SCM_FRAME_VIRTUAL_RETURN_ADDRESS (vp->fp);
>
> +  if (vp->unused) { vp->unused = 0; return 0; }
>    scm_error (scm_arg_type_key, NULL, "Wrong type to apply: ~S",
>               scm_list_1 (proc), scm_list_1 (proc));
>  }
>
> On Wed, Jul 17, 2019 at 8:42 PM Linas Vepstas <[email protected]>
> wrote:
>
>> Seem to be narrowing it down ... or at least, I have more details ...
>>
>> On Wed, Jul 17, 2019 at 4:44 PM Linas Vepstas <[email protected]>
>> wrote:
>>
>>>
>>>
>>> On Wed, Jul 17, 2019 at 12:49 PM Mark H Weaver <[email protected]> wrote:
>>>
>>>> Hi Linas,
>>>>
>>>> > Investigating the crash with good-old printf's in libguile/vm.c
>>>> produces
>>>> > a vast ocean of prints ... that should have not been printed, and/or
>>>> should
>>>> > have been actual errors, but somehow were not handled by scm_error.
>>>> > Using today's git pull of master, here's the diff containing a printf:
>>>> >
>>>> > --- a/libguile/vm.c
>>>> > +++ b/libguile/vm.c
>>>> > @@ -1514,12 +1514,23 @@ thread->guard); fflush(stdout); assert (0); }
>>>> >
>>>> >        proc = SCM_SMOB_DESCRIPTOR (proc).apply_trampoline;
>>>> >        SCM_FRAME_LOCAL (vp->fp, 0) = proc;
>>>> >        return SCM_PROGRAM_CODE (proc);
>>>> >      }
>>>> >
>>>> > +printf("duuude wrong type to apply!\n"
>>>> > +"proc=%lx\n"
>>>> > +"ip=%p\n"
>>>> > +"sp=%p\n"
>>>> > +"fp=%p\n"
>>>> > +"sp_min=%p\n"
>>>> > +"stack_lim=%p\n",
>>>> > +SCM_FRAME_SLOT(vp->fp, 0)->as_u64,
>>>> > +vp->ip, vp->sp, vp->fp, vp->sp_min_since_gc, vp->stack_limit);
>>>> > +fflush(stdout);
>>>> > +
>>>> >    vp->ip = SCM_FRAME_VIRTUAL_RETURN_ADDRESS (vp->fp);
>>>> >
>>>> >    scm_error (scm_arg_type_key, NULL, "Wrong type to apply: ~S",
>>>> >               scm_list_1 (proc), scm_list_1 (proc));
>>>> >  }
>>>> >
>>>> > As you can see, shortly after my printf, there should have been an
>>>> > error report.
>>>>
>>>> Not necessarily.  Note that what 'scm_error' actually does is to raise
>>>> an exception.  What happens next depends on what exception handlers are
>>>> installed at the time of the error.
>>>>
>>>
>>> OK, but... when I look at what get_callee_vcode() actually does, it seems
>>> to be earnestly trying to fish out the location of a callable function
>>> from the
>>> frame pointer, and it does so three plausible ways. If those three don't
>>> work
>>> out, then it sets the instruction pointer (to the garbage value),
>>> followed by
>>> scm_error(Wrong type to apply). This also looks like an earnest, honest
>>> attempt to report a real error.  But lets double-check.
>>>
>>> So who calls get_callee_vcode(), and why, and what did they expect to
>>> happen?
>>> Well, that's in three places: one in scm_call_n which is a plausible
>>> place where
>>> one might expect the instruction pointer to be set to a valid value.
>>> Then there's two
>>> places in vm-engine.c -- "tail-call" and "call" both of which one might
>>> plausibly expect
>>> to have a valid instruction pointer.  I can't imagine any valid scenario
>>> where anyone
>>> was expecting get_callee_vcode() to actually fail in the normal course
>>> of operations.
>>>
>>
>> There is one more place where  get_callee_vcode() can get called -- via
>> the jump_table,
>> via a call to scm_jit_enter_mcode()  which issues the code emitted by
>> emit_get_callee_vcode
>>
>> There are four calls to scm_jit_enter_mcode()  The one that immediately
>> preceeds
>> the bug is always the one made here, in vm-engine.c:
>> VM_DEFINE_OP (7, return_values, "return-values", OP1 (X32))
>>
>> Right before the call to scm_jit_enter_mcode(), I can printf VP->fp and
>> SCM_FRAME_LOCAL(VP->fp, 0),
>> and they are... fp=0x7fffe000caf8 fpslot=d33b00 (typical)
>>
>> the mcode is of course some bytecode that bounces through lightning, and
>> a few insns
>> later, it arrives at get_callee_vcode() but now  the fp is different, (it
>> changes by 0x20,
>> always) and the slot is different:  fp=0x7fffe000cad8  and
>> SCM_FRAME_LOCAL(fp,0)
>> is 0x32 and the 0x32 triggers the scm_error(). (because 0x32 is not any
>> of
>> SCM_PROGRAM_P or SCM_STRUCTP or a smob)
>>
>> (but also, the fpslot=d33b00 is never a SCM_PROGRAM_P or SCM_STRUCTP or
>> a smob, either... so something got computed along the way ... )
>>
>> That's what I've got so far. Its highly reproducible.  Quick to happen.
>> I'm not sure
>> what to do next. I guess I need to examine emit_get_callee_vcode() and
>> see what
>> it does, and why.   Any comments, suggestions would be useful.
>>
>> -- Linas
>>
>>
>>> That is, I can't think of any valid reason why anyone would want to
>>> suppress
>>> the scm_error().  And even if I could -- calling scm_error() hundreds of
>>> times
>>> per second, as fast as possible, does not seem like efficient coding for
>>> dealing
>>> with a call to an invalid address.
>>>
>>> Anyway I'm trying to track down where the invalid value gets set. No
>>> luck so far.
>>> There are 6 or 8 places in vm-engine.c where the frame pointer is set to
>>> something
>>> that isn't a pointer (which seems like cheating to me: passing
>>> non-pointer values
>>> in something called "pointer" is .. well, knee jerk reaction is that
>>> it's not wise, but
>>> there may be a deeper reason.)
>>>
>>>
>>>>
>>>> > There is no error report... until 5-10 minutes later, when the error
>>>> > report itself causes a crash.  Before then, I get an endless
>>>> > high-speed spew of prints:
>>>>
>>>> It looks like another error is happening within the exception handler.
>>>>
>>>
>>> Well, yes, that also. But given that the instruction pointer contains
>>> garbage
>>> its perhaps not entirely surprising... at best, the question is, why
>>> didn't it fail
>>> sooner?
>>>
>>> -- Linas
>>>
>>>>
>>>>        Mark
>>>>
>>>> PS: It would be good to pick either 'guile-devel' or 'guile-user' for
>>>>     continuation of this thread.  I don't see a reason why it should be
>>>>     sent to both lists.
>>>>
>>>
>>>
>>> --
>>> cassette tapes - analog TV - film cameras - you
>>>
>>
>>
>> --
>> cassette tapes - analog TV - film cameras - you
>>
>
>
> --
> cassette tapes - analog TV - film cameras - you
>


-- 
cassette tapes - analog TV - film cameras - you

Re: Now crashing [was Re: guile-2.9.2 and threading

Reply via email to