Re: [lldb-dev] The lit test driver gets killed because of SIGHUP

2018-12-05 Thread Pavel Labath via lldb-dev

On 05/12/2018 03:49, Jonas Devlieghere via lldb-dev wrote:

Hi everyone,

Since we switched to lit as the test driver we've been seeing it getting killed 
as the result of a SIGHUP signal. The problem doesn't reproduce on every 
machine and there seems to be a correlation between number of occurrences and 
thread count.

Davide and Raphael spent some time narrowing down what particular test is 
causing this and it seems that TestChangeProcessGroup.py is always involved. 
However it never reproduces when running just this test. I was able to 
reproduce pretty consistently with the following filter:

./bin/llvm-lit ../llvm/tools/lldb/lit/Suite/ --filter="process"

Bisecting the test itself didn't help much, the problem reproduces as soon as 
we attach to the inferior.

At this point it is still not clear who is sending the SIGHUP and why it's 
reaching the lit test driver. Fred suggested that it might have something to do 
with process groups (which would be an interesting coincidence given the 
previously mentioned test) and he suggested having the test run in different 
process groups. Indeed, adding a call to os.setpgrp() in lit's executeCommand 
and having a different process group per test prevent us from seeing this. 
Regardless of this issue I think it's reasonable to have tests run in their 
process group, so if nobody objects I propose adding this to lit in llvm.

Still, I'd like to understand where the signal is coming from and fix the root 
cause in addition to the symptom. Maybe someone here has an idea of what might 
be going on?

Thanks,
Jonas

PS
  
1. There's two places where we send a SIGHUP ourself, with that code removed we still receive the signal, which suggests that it might be coming from Python or the OS.

2. If you're able to reproduce you'll see that adding an early return before 
the attach in TestChangeProcessGroup.py hides/prevents the problem. Moving the 
return down one line and it pops up again.
___
lldb-dev mailing list
lldb-dev@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev



Hi Jonas,

Sounds like you have found an interesting issue to debug. I've tried 
running the command you mention locally, and I didn't see any failures 
in 100 runs.


There doesn't seem to be anything in the TestChangeProcessGroup which 
sends a signal, though I can imagine that the act of changing a process 
group mid-debug could be enough to confuse someone to send it. However, 
I am having trouble reconciling this with your PS #2, because if 
attaching is sufficient to trigger this (i.e., no group changing takes 
place), then this test is not much different than any other test where 
we spawn an inferior and then attach to it.


I am aware of one other instance where we send a spurious signal, though 
it's SIGINT in this case 
. 
The issue there is that we don't check whether the debug server has 
exited before we send SIGINT to it (which it normally does on its own at 
the end of debug session). So if the debug server does exit and its pid 
gets recycled before we get a chance to send this signal, we can end up 
killing a random process.


Now this may seem unrelated to your issue, but SIGHUP is sent 
automatically as a result of a process losing its controlling tty. So, 
if that SIGINT ends up killing the process holding the master end of a 
pty, this could result in some SIGHUPs being sent too. Unfortunately, 
this doesn't fully stack up either, because the process holding the 
master pty is probably a long-lived one, so its pid is unlikely to match 
one of the transient debugserver pids. Nevertheless, it could be worth 
just commenting out that line and seeing what happens.


For debugging, maybe you could try installing a SIGHUP handler into the 
lit process, which would dump the received siginfo_t structure. Decoding 
that may provide some insight into who is sending that signal (si_pid) 
and why (si_code).


As for adding process group support into lit, I think that having each 
test run (*not* each executed command) in it's own group is reasonable. 
However, be aware that this changes the behaviour of how all signals (in 
particular the SIGINT you get when typing ^C) get delivered. AFAIK, lit 
doesn't have any special code for cleaning up the spawned processes and 
relies on the fact that ^C will send a SIGINT to the entire "foreground 
process group" and terminate stuff. If you start creating a bunch of 
process groups, you may need to add more explicit termination logic too.


cheers,
pl
___
lldb-dev mailing list
lldb-dev@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev


Re: [lldb-dev] The lit test driver gets killed because of SIGHUP

2018-12-05 Thread Raphael Isemann via lldb-dev
@Jonas: Did you confirm it is SIGHUP? I remember that we were not sure
whether the signal kind was SIGHUP or SIGINT.

- Raphael
Am Mi., 5. Dez. 2018 um 10:25 Uhr schrieb Pavel Labath via lldb-dev
:
>
> On 05/12/2018 03:49, Jonas Devlieghere via lldb-dev wrote:
> > Hi everyone,
> >
> > Since we switched to lit as the test driver we've been seeing it getting 
> > killed as the result of a SIGHUP signal. The problem doesn't reproduce on 
> > every machine and there seems to be a correlation between number of 
> > occurrences and thread count.
> >
> > Davide and Raphael spent some time narrowing down what particular test is 
> > causing this and it seems that TestChangeProcessGroup.py is always 
> > involved. However it never reproduces when running just this test. I was 
> > able to reproduce pretty consistently with the following filter:
> >
> > ./bin/llvm-lit ../llvm/tools/lldb/lit/Suite/ --filter="process"
> >
> > Bisecting the test itself didn't help much, the problem reproduces as soon 
> > as we attach to the inferior.
> >
> > At this point it is still not clear who is sending the SIGHUP and why it's 
> > reaching the lit test driver. Fred suggested that it might have something 
> > to do with process groups (which would be an interesting coincidence given 
> > the previously mentioned test) and he suggested having the test run in 
> > different process groups. Indeed, adding a call to os.setpgrp() in lit's 
> > executeCommand and having a different process group per test prevent us 
> > from seeing this. Regardless of this issue I think it's reasonable to have 
> > tests run in their process group, so if nobody objects I propose adding 
> > this to lit in llvm.
> >
> > Still, I'd like to understand where the signal is coming from and fix the 
> > root cause in addition to the symptom. Maybe someone here has an idea of 
> > what might be going on?
> >
> > Thanks,
> > Jonas
> >
> > PS
> >
> > 1. There's two places where we send a SIGHUP ourself, with that code 
> > removed we still receive the signal, which suggests that it might be coming 
> > from Python or the OS.
> > 2. If you're able to reproduce you'll see that adding an early return 
> > before the attach in TestChangeProcessGroup.py hides/prevents the problem. 
> > Moving the return down one line and it pops up again.
> > ___
> > lldb-dev mailing list
> > lldb-dev@lists.llvm.org
> > http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev
> >
>
> Hi Jonas,
>
> Sounds like you have found an interesting issue to debug. I've tried
> running the command you mention locally, and I didn't see any failures
> in 100 runs.
>
> There doesn't seem to be anything in the TestChangeProcessGroup which
> sends a signal, though I can imagine that the act of changing a process
> group mid-debug could be enough to confuse someone to send it. However,
> I am having trouble reconciling this with your PS #2, because if
> attaching is sufficient to trigger this (i.e., no group changing takes
> place), then this test is not much different than any other test where
> we spawn an inferior and then attach to it.
>
> I am aware of one other instance where we send a spurious signal, though
> it's SIGINT in this case
> .
> The issue there is that we don't check whether the debug server has
> exited before we send SIGINT to it (which it normally does on its own at
> the end of debug session). So if the debug server does exit and its pid
> gets recycled before we get a chance to send this signal, we can end up
> killing a random process.
>
> Now this may seem unrelated to your issue, but SIGHUP is sent
> automatically as a result of a process losing its controlling tty. So,
> if that SIGINT ends up killing the process holding the master end of a
> pty, this could result in some SIGHUPs being sent too. Unfortunately,
> this doesn't fully stack up either, because the process holding the
> master pty is probably a long-lived one, so its pid is unlikely to match
> one of the transient debugserver pids. Nevertheless, it could be worth
> just commenting out that line and seeing what happens.
>
> For debugging, maybe you could try installing a SIGHUP handler into the
> lit process, which would dump the received siginfo_t structure. Decoding
> that may provide some insight into who is sending that signal (si_pid)
> and why (si_code).
>
> As for adding process group support into lit, I think that having each
> test run (*not* each executed command) in it's own group is reasonable.
> However, be aware that this changes the behaviour of how all signals (in
> particular the SIGINT you get when typing ^C) get delivered. AFAIK, lit
> doesn't have any special code for cleaning up the spawned processes and
> relies on the fact that ^C will send a SIGINT to the entire "foreground
> process group" and terminate stuff. If you start creating a bun

Re: [lldb-dev] The lit test driver gets killed because of SIGHUP

2018-12-05 Thread Jonas Devlieghere via lldb-dev
On Wed, Dec 5, 2018 at 5:01 AM Raphael Isemann via lldb-dev <
lldb-dev@lists.llvm.org> wrote:

> @Jonas: Did you confirm it is SIGHUP? I remember that we were not sure
> whether the signal kind was SIGHUP or SIGINT.
>

I'm relatively sure. I added a signal handler to lit and it fires on the
signal.


>
> - Raphael
> Am Mi., 5. Dez. 2018 um 10:25 Uhr schrieb Pavel Labath via lldb-dev
> :
> >
> > On 05/12/2018 03:49, Jonas Devlieghere via lldb-dev wrote:
> > > Hi everyone,
> > >
> > > Since we switched to lit as the test driver we've been seeing it
> getting killed as the result of a SIGHUP signal. The problem doesn't
> reproduce on every machine and there seems to be a correlation between
> number of occurrences and thread count.
> > >
> > > Davide and Raphael spent some time narrowing down what particular test
> is causing this and it seems that TestChangeProcessGroup.py is always
> involved. However it never reproduces when running just this test. I was
> able to reproduce pretty consistently with the following filter:
> > >
> > > ./bin/llvm-lit ../llvm/tools/lldb/lit/Suite/ --filter="process"
> > >
> > > Bisecting the test itself didn't help much, the problem reproduces as
> soon as we attach to the inferior.
> > >
> > > At this point it is still not clear who is sending the SIGHUP and why
> it's reaching the lit test driver. Fred suggested that it might have
> something to do with process groups (which would be an interesting
> coincidence given the previously mentioned test) and he suggested having
> the test run in different process groups. Indeed, adding a call to
> os.setpgrp() in lit's executeCommand and having a different process group
> per test prevent us from seeing this. Regardless of this issue I think it's
> reasonable to have tests run in their process group, so if nobody objects I
> propose adding this to lit in llvm.
> > >
> > > Still, I'd like to understand where the signal is coming from and fix
> the root cause in addition to the symptom. Maybe someone here has an idea
> of what might be going on?
> > >
> > > Thanks,
> > > Jonas
> > >
> > > PS
> > >
> > > 1. There's two places where we send a SIGHUP ourself, with that code
> removed we still receive the signal, which suggests that it might be coming
> from Python or the OS.
> > > 2. If you're able to reproduce you'll see that adding an early return
> before the attach in TestChangeProcessGroup.py hides/prevents the problem.
> Moving the return down one line and it pops up again.
> > > ___
> > > lldb-dev mailing list
> > > lldb-dev@lists.llvm.org
> > > http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev
> > >
> >
> > Hi Jonas,
> >
> > Sounds like you have found an interesting issue to debug. I've tried
> > running the command you mention locally, and I didn't see any failures
> > in 100 runs.


Thank you. This confirm my suspicion that it's likely a Darwin-only thing.


> > There doesn't seem to be anything in the TestChangeProcessGroup which
> > sends a signal, though I can imagine that the act of changing a process
> > group mid-debug could be enough to confuse someone to send it. However,
> > I am having trouble reconciling this with your PS #2, because if
> > attaching is sufficient to trigger this (i.e., no group changing takes
> > place), then this test is not much different than any other test where
> > we spawn an inferior and then attach to it.
>

I agree, I think it might be just coincidence. Also running only this test
never fails so there is some timing involved. It looks like we needed at
least one other process-manipulating test to make it reproduce, but this is
just observation on my part and hard to confirm.


> > I am aware of one other instance where we send a spurious signal, though
> > it's SIGINT in this case
> > <
> https://github.com/llvm-mirror/lldb/blob/master/source/Plugins/Process/gdb-remote/ProcessGDBRemote.cpp#L3645
> >.
> > The issue there is that we don't check whether the debug server has
> > exited before we send SIGINT to it (which it normally does on its own at
> > the end of debug session). So if the debug server does exit and its pid
> > gets recycled before we get a chance to send this signal, we can end up
> > killing a random process.
>

I believe that posix doesn't make this guarantee, but that in reality
neither linux nor darwin recycles pids before they wrap around? I don't see
this signal in my DTrace output though. What I do see is that debugserver
sends a SIGSTOP (13) to the inferior. I was entertaining the idea that
maybe lldb crashes, debugserver notices that the socket is closed and tries
to stop the inferior before killing it. We run the inferior with
"shell=True" so if that somehow causes the shell to crash it might send a
signal to the whole group.


> > Now this may seem unrelated to your issue, but SIGHUP is sent
> > automatically as a result of a process losing its controlling tty. So,
> > if that SIGINT ends up killing the process hol

Re: [lldb-dev] The lit test driver gets killed because of SIGHUP

2018-12-05 Thread Pavel Labath via lldb-dev

On 05/12/2018 18:36, Jonas Devlieghere wrote:
I believe that posix doesn't make this guarantee, but that in reality 
neither linux nor darwin recycles pids before they wrap around?


Yes, linux tries pretty hard to not recycle pids, but this is hampered 
by the fact that the default pid limit  is 32k and that process and 
thread ids share the same namespace.


So given that we have over 1k tests and each test can easily spawn over 
32 tids/pids on a 32-core machine (parallel parsing of debug info), we 
can easily run through the whole pid pool in a single test run.

___
lldb-dev mailing list
lldb-dev@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev


Re: [lldb-dev] The lit test driver gets killed because of SIGHUP

2018-12-05 Thread Jonas Devlieghere via lldb-dev
On Wed, Dec 5, 2018 at 9:45 AM Pavel Labath  wrote:

> On 05/12/2018 18:36, Jonas Devlieghere wrote:
> > I believe that posix doesn't make this guarantee, but that in reality
> > neither linux nor darwin recycles pids before they wrap around?
>
> Yes, linux tries pretty hard to not recycle pids, but this is hampered
> by the fact that the default pid limit  is 32k and that process and
> thread ids share the same namespace.
>
> So given that we have over 1k tests and each test can easily spawn over
> 32 tids/pids on a 32-core machine (parallel parsing of debug info), we
> can easily run through the whole pid pool in a single test run.
>

Yeah, that makes sense. What I meant is that I don't think it's what's
happening here as we're only running a few tests and I can see the pids are
moslty consecutive.

As I mentioned in the previous e-mail I wrote a Python module in C to check
who's sending the PID. It turns out it's the inferior, the one that's
receiving signal 17 (SIGSTOP) from the debugserver. This actually makes
sense as SIGSTOP is involved in process control. I'd have to double check,
but it would definitely make sense if this signal is sent to the whole
foreground process group, similar to ^C and it would explain why changing
that prevents this from happening.

I'm going to see what happens if I change the SIGSTOP into a SIGINT.
___
lldb-dev mailing list
lldb-dev@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev