On 05/12/2018 03:49, Jonas Devlieghere via lldb-dev wrote:
Hi everyone,
Since we switched to lit as the test driver we've been seeing it getting killed
as the result of a SIGHUP signal. The problem doesn't reproduce on every
machine and there seems to be a correlation between number of occurrences and
thread count.
Davide and Raphael spent some time narrowing down what particular test is
causing this and it seems that TestChangeProcessGroup.py is always involved.
However it never reproduces when running just this test. I was able to
reproduce pretty consistently with the following filter:
./bin/llvm-lit ../llvm/tools/lldb/lit/Suite/ --filter="process"
Bisecting the test itself didn't help much, the problem reproduces as soon as
we attach to the inferior.
At this point it is still not clear who is sending the SIGHUP and why it's
reaching the lit test driver. Fred suggested that it might have something to do
with process groups (which would be an interesting coincidence given the
previously mentioned test) and he suggested having the test run in different
process groups. Indeed, adding a call to os.setpgrp() in lit's executeCommand
and having a different process group per test prevent us from seeing this.
Regardless of this issue I think it's reasonable to have tests run in their
process group, so if nobody objects I propose adding this to lit in llvm.
Still, I'd like to understand where the signal is coming from and fix the root
cause in addition to the symptom. Maybe someone here has an idea of what might
be going on?
Thanks,
Jonas
PS
1. There's two places where we send a SIGHUP ourself, with that code removed we still receive the signal, which suggests that it might be coming from Python or the OS.
2. If you're able to reproduce you'll see that adding an early return before
the attach in TestChangeProcessGroup.py hides/prevents the problem. Moving the
return down one line and it pops up again.
_______________________________________________
lldb-dev mailing list
lldb-dev@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev
Hi Jonas,
Sounds like you have found an interesting issue to debug. I've tried
running the command you mention locally, and I didn't see any failures
in 100 runs.
There doesn't seem to be anything in the TestChangeProcessGroup which
sends a signal, though I can imagine that the act of changing a process
group mid-debug could be enough to confuse someone to send it. However,
I am having trouble reconciling this with your PS #2, because if
attaching is sufficient to trigger this (i.e., no group changing takes
place), then this test is not much different than any other test where
we spawn an inferior and then attach to it.
I am aware of one other instance where we send a spurious signal, though
it's SIGINT in this case
<https://github.com/llvm-mirror/lldb/blob/master/source/Plugins/Process/gdb-remote/ProcessGDBRemote.cpp#L3645>.
The issue there is that we don't check whether the debug server has
exited before we send SIGINT to it (which it normally does on its own at
the end of debug session). So if the debug server does exit and its pid
gets recycled before we get a chance to send this signal, we can end up
killing a random process.
Now this may seem unrelated to your issue, but SIGHUP is sent
automatically as a result of a process losing its controlling tty. So,
if that SIGINT ends up killing the process holding the master end of a
pty, this could result in some SIGHUPs being sent too. Unfortunately,
this doesn't fully stack up either, because the process holding the
master pty is probably a long-lived one, so its pid is unlikely to match
one of the transient debugserver pids. Nevertheless, it could be worth
just commenting out that line and seeing what happens.
For debugging, maybe you could try installing a SIGHUP handler into the
lit process, which would dump the received siginfo_t structure. Decoding
that may provide some insight into who is sending that signal (si_pid)
and why (si_code).
As for adding process group support into lit, I think that having each
test run (*not* each executed command) in it's own group is reasonable.
However, be aware that this changes the behaviour of how all signals (in
particular the SIGINT you get when typing ^C) get delivered. AFAIK, lit
doesn't have any special code for cleaning up the spawned processes and
relies on the fact that ^C will send a SIGINT to the entire "foreground
process group" and terminate stuff. If you start creating a bunch of
process groups, you may need to add more explicit termination logic too.
cheers,
pl
_______________________________________________
lldb-dev mailing list
lldb-dev@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev