Re: [lldb-dev] The lit test driver gets killed because of SIGHUP

Jonas Devlieghere via lldb-dev Wed, 05 Dec 2018 09:37:22 -0800

On Wed, Dec 5, 2018 at 5:01 AM Raphael Isemann via lldb-dev <
[email protected]> wrote:


> @Jonas: Did you confirm it is SIGHUP? I remember that we were not sure
> whether the signal kind was SIGHUP or SIGINT.
>

I'm relatively sure. I added a signal handler to lit and it fires on the
signal.


>
> - Raphael
> Am Mi., 5. Dez. 2018 um 10:25 Uhr schrieb Pavel Labath via lldb-dev
> <[email protected]>:
> >
> > On 05/12/2018 03:49, Jonas Devlieghere via lldb-dev wrote:
> > > Hi everyone,
> > >
> > > Since we switched to lit as the test driver we've been seeing it
> getting killed as the result of a SIGHUP signal. The problem doesn't
> reproduce on every machine and there seems to be a correlation between
> number of occurrences and thread count.
> > >
> > > Davide and Raphael spent some time narrowing down what particular test
> is causing this and it seems that TestChangeProcessGroup.py is always
> involved. However it never reproduces when running just this test. I was
> able to reproduce pretty consistently with the following filter:
> > >
> > > ./bin/llvm-lit ../llvm/tools/lldb/lit/Suite/ --filter="process"
> > >
> > > Bisecting the test itself didn't help much, the problem reproduces as
> soon as we attach to the inferior.
> > >
> > > At this point it is still not clear who is sending the SIGHUP and why
> it's reaching the lit test driver. Fred suggested that it might have
> something to do with process groups (which would be an interesting
> coincidence given the previously mentioned test) and he suggested having
> the test run in different process groups. Indeed, adding a call to
> os.setpgrp() in lit's executeCommand and having a different process group
> per test prevent us from seeing this. Regardless of this issue I think it's
> reasonable to have tests run in their process group, so if nobody objects I
> propose adding this to lit in llvm.
> > >
> > > Still, I'd like to understand where the signal is coming from and fix
> the root cause in addition to the symptom. Maybe someone here has an idea
> of what might be going on?
> > >
> > > Thanks,
> > > Jonas
> > >
> > > PS
> > >
> > > 1. There's two places where we send a SIGHUP ourself, with that code
> removed we still receive the signal, which suggests that it might be coming
> from Python or the OS.
> > > 2. If you're able to reproduce you'll see that adding an early return
> before the attach in TestChangeProcessGroup.py hides/prevents the problem.
> Moving the return down one line and it pops up again.
> > > _______________________________________________
> > > lldb-dev mailing list
> > > [email protected]
> > > http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev
> > >
> >
> > Hi Jonas,
> >
> > Sounds like you have found an interesting issue to debug. I've tried
> > running the command you mention locally, and I didn't see any failures
> > in 100 runs.


Thank you. This confirm my suspicion that it's likely a Darwin-only thing.


> > There doesn't seem to be anything in the TestChangeProcessGroup which
> > sends a signal, though I can imagine that the act of changing a process
> > group mid-debug could be enough to confuse someone to send it. However,
> > I am having trouble reconciling this with your PS #2, because if
> > attaching is sufficient to trigger this (i.e., no group changing takes
> > place), then this test is not much different than any other test where
> > we spawn an inferior and then attach to it.
>

I agree, I think it might be just coincidence. Also running only this test
never fails so there is some timing involved. It looks like we needed at
least one other process-manipulating test to make it reproduce, but this is
just observation on my part and hard to confirm.


> > I am aware of one other instance where we send a spurious signal, though
> > it's SIGINT in this case
> > <
> https://github.com/llvm-mirror/lldb/blob/master/source/Plugins/Process/gdb-remote/ProcessGDBRemote.cpp#L3645
> >.
> > The issue there is that we don't check whether the debug server has
> > exited before we send SIGINT to it (which it normally does on its own at
> > the end of debug session). So if the debug server does exit and its pid
> > gets recycled before we get a chance to send this signal, we can end up
> > killing a random process.
>

I believe that posix doesn't make this guarantee, but that in reality
neither linux nor darwin recycles pids before they wrap around? I don't see
this signal in my DTrace output though. What I do see is that debugserver
sends a SIGSTOP (13) to the inferior. I was entertaining the idea that
maybe lldb crashes, debugserver notices that the socket is closed and tries
to stop the inferior before killing it. We run the inferior with
"shell=True" so if that somehow causes the shell to crash it might send a
signal to the whole group.


> > Now this may seem unrelated to your issue, but SIGHUP is sent
> > automatically as a result of a process losing its controlling tty. So,
> > if that SIGINT ends up killing the process holding the master end of a
> > pty, this could result in some SIGHUPs being sent too. Unfortunately,
> > this doesn't fully stack up either, because the process holding the
> > master pty is probably a long-lived one, so its pid is unlikely to match
> > one of the transient debugserver pids. Nevertheless, it could be worth
> > just commenting out that line and seeing what happens.
> >
> > For debugging, maybe you could try installing a SIGHUP handler into the
> > lit process, which would dump the received siginfo_t structure. Decoding
> > that may provide some insight into who is sending that signal (si_pid)
> > and why (si_code).
>

This doesn't seem to be available in Python 2 so I'm trying to write my own
module and loading that.


> > As for adding process group support into lit, I think that having each
> > test run (*not* each executed command) in it's own group is reasonable.
> > However, be aware that this changes the behaviour of how all signals (in
> > particular the SIGINT you get when typing ^C) get delivered. AFAIK, lit
> > doesn't have any special code for cleaning up the spawned processes and
> > relies on the fact that ^C will send a SIGINT to the entire "foreground
> > process group" and terminate stuff. If you start creating a bunch of
> > process groups, you may need to add more explicit termination logic too.
>

Good point, being able to ^C was a big part of the reason I wanted to
switch to lit in the first place so this is the last thing I want to break
;-)


> > cheers,
> > pl
> > _______________________________________________
> > lldb-dev mailing list
> > [email protected]
> > http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev
> _______________________________________________
> lldb-dev mailing list
> [email protected]
> http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev
>

_______________________________________________
lldb-dev mailing list
[email protected]
http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev

Re: [lldb-dev] The lit test driver gets killed because of SIGHUP

Reply via email to