On Wed, Dec 5, 2018 at 5:01 AM Raphael Isemann via lldb-dev < lldb-dev@lists.llvm.org> wrote:
> @Jonas: Did you confirm it is SIGHUP? I remember that we were not sure > whether the signal kind was SIGHUP or SIGINT. > I'm relatively sure. I added a signal handler to lit and it fires on the signal. > > - Raphael > Am Mi., 5. Dez. 2018 um 10:25 Uhr schrieb Pavel Labath via lldb-dev > <lldb-dev@lists.llvm.org>: > > > > On 05/12/2018 03:49, Jonas Devlieghere via lldb-dev wrote: > > > Hi everyone, > > > > > > Since we switched to lit as the test driver we've been seeing it > getting killed as the result of a SIGHUP signal. The problem doesn't > reproduce on every machine and there seems to be a correlation between > number of occurrences and thread count. > > > > > > Davide and Raphael spent some time narrowing down what particular test > is causing this and it seems that TestChangeProcessGroup.py is always > involved. However it never reproduces when running just this test. I was > able to reproduce pretty consistently with the following filter: > > > > > > ./bin/llvm-lit ../llvm/tools/lldb/lit/Suite/ --filter="process" > > > > > > Bisecting the test itself didn't help much, the problem reproduces as > soon as we attach to the inferior. > > > > > > At this point it is still not clear who is sending the SIGHUP and why > it's reaching the lit test driver. Fred suggested that it might have > something to do with process groups (which would be an interesting > coincidence given the previously mentioned test) and he suggested having > the test run in different process groups. Indeed, adding a call to > os.setpgrp() in lit's executeCommand and having a different process group > per test prevent us from seeing this. Regardless of this issue I think it's > reasonable to have tests run in their process group, so if nobody objects I > propose adding this to lit in llvm. > > > > > > Still, I'd like to understand where the signal is coming from and fix > the root cause in addition to the symptom. Maybe someone here has an idea > of what might be going on? > > > > > > Thanks, > > > Jonas > > > > > > PS > > > > > > 1. There's two places where we send a SIGHUP ourself, with that code > removed we still receive the signal, which suggests that it might be coming > from Python or the OS. > > > 2. If you're able to reproduce you'll see that adding an early return > before the attach in TestChangeProcessGroup.py hides/prevents the problem. > Moving the return down one line and it pops up again. > > > _______________________________________________ > > > lldb-dev mailing list > > > lldb-dev@lists.llvm.org > > > http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev > > > > > > > Hi Jonas, > > > > Sounds like you have found an interesting issue to debug. I've tried > > running the command you mention locally, and I didn't see any failures > > in 100 runs. Thank you. This confirm my suspicion that it's likely a Darwin-only thing. > > There doesn't seem to be anything in the TestChangeProcessGroup which > > sends a signal, though I can imagine that the act of changing a process > > group mid-debug could be enough to confuse someone to send it. However, > > I am having trouble reconciling this with your PS #2, because if > > attaching is sufficient to trigger this (i.e., no group changing takes > > place), then this test is not much different than any other test where > > we spawn an inferior and then attach to it. > I agree, I think it might be just coincidence. Also running only this test never fails so there is some timing involved. It looks like we needed at least one other process-manipulating test to make it reproduce, but this is just observation on my part and hard to confirm. > > I am aware of one other instance where we send a spurious signal, though > > it's SIGINT in this case > > < > https://github.com/llvm-mirror/lldb/blob/master/source/Plugins/Process/gdb-remote/ProcessGDBRemote.cpp#L3645 > >. > > The issue there is that we don't check whether the debug server has > > exited before we send SIGINT to it (which it normally does on its own at > > the end of debug session). So if the debug server does exit and its pid > > gets recycled before we get a chance to send this signal, we can end up > > killing a random process. > I believe that posix doesn't make this guarantee, but that in reality neither linux nor darwin recycles pids before they wrap around? I don't see this signal in my DTrace output though. What I do see is that debugserver sends a SIGSTOP (13) to the inferior. I was entertaining the idea that maybe lldb crashes, debugserver notices that the socket is closed and tries to stop the inferior before killing it. We run the inferior with "shell=True" so if that somehow causes the shell to crash it might send a signal to the whole group. > > Now this may seem unrelated to your issue, but SIGHUP is sent > > automatically as a result of a process losing its controlling tty. So, > > if that SIGINT ends up killing the process holding the master end of a > > pty, this could result in some SIGHUPs being sent too. Unfortunately, > > this doesn't fully stack up either, because the process holding the > > master pty is probably a long-lived one, so its pid is unlikely to match > > one of the transient debugserver pids. Nevertheless, it could be worth > > just commenting out that line and seeing what happens. > > > > For debugging, maybe you could try installing a SIGHUP handler into the > > lit process, which would dump the received siginfo_t structure. Decoding > > that may provide some insight into who is sending that signal (si_pid) > > and why (si_code). > This doesn't seem to be available in Python 2 so I'm trying to write my own module and loading that. > > As for adding process group support into lit, I think that having each > > test run (*not* each executed command) in it's own group is reasonable. > > However, be aware that this changes the behaviour of how all signals (in > > particular the SIGINT you get when typing ^C) get delivered. AFAIK, lit > > doesn't have any special code for cleaning up the spawned processes and > > relies on the fact that ^C will send a SIGINT to the entire "foreground > > process group" and terminate stuff. If you start creating a bunch of > > process groups, you may need to add more explicit termination logic too. > Good point, being able to ^C was a big part of the reason I wanted to switch to lit in the first place so this is the last thing I want to break ;-) > > cheers, > > pl > > _______________________________________________ > > lldb-dev mailing list > > lldb-dev@lists.llvm.org > > http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev > _______________________________________________ > lldb-dev mailing list > lldb-dev@lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev >
_______________________________________________ lldb-dev mailing list lldb-dev@lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev