Sending again, this time as plain text (I hope)...
On 20 March 2015 at 18:46, Pavel Labath <lab...@google.com> wrote: > > Hi, > > thanks for the super quick response. :) > > I am at home now, so I don't have access to the same machine to run the test. > I will run it on monday and let you know. > > Meanwhile, I have tried running your test on my home machine, and it is > indeed reporting "unexpected wait: stat=57f". If I understand correctly, that > means the wait has reported sigtrap even though the tracee was in ptrace-stop. > > I can imagine that something similar is happening in our case. Since > PTRACE_CONT and waitpid calls are happening in different threads, I can't > positively say which one has occurred sooner. So far I have assumed the > sequence was PTRACE_CONT -> waitpid -> PTRACE_SIGINFO. However, if wait can > return even though the process is stopped then a possible sequence of events > is waitpid -> PTRACE_CONT -> PTRACE_SIGINFO, in which case it is not > surprising that the last call fails. One difference I see though is that in > our test, we are not sending any additional signals to the thread in question > (at least we shouldn't be sending them, but we are sending some signals to > other threads in the same process). Do you think it could still be the same > issue? > > I would be happy to test your patch. I don't think I can patch the kernel on > my work machine directly, but I think I might be able to set up some sort of > a test environment to try it out. > > regards, > pavel > > > On 20 March 2015 at 16:25, Oleg Nesterov <o...@redhat.com> wrote: >> >> Hi Pavel, >> >> let me add lkml, we should not discuss this offlist. >> >> On 03/20, Pavel Labath wrote: >> > >> > 1) we get a waitpid() notification that the tracee got SIGUSR1 >> > 2) we do a ptrace(GETSIGINFO) to get more info >> > 3) eventually we decide to restart the tracee with PTRACE_CONT, passing it >> > SIGUSR1 >> > 4) immediately after that we get another waitpid notification, again with >> > SIGUSR1, even though the thread had received no additional signals >> > 5) we again try to a GETSIGINFO, however this time it fails with ESRCH. >> > Therefore, we assume that the thread has died >> >> I found a similar bug by code inspection some time ago. I even have >> a fix, but I need to think more... And I even wrote the test-case ;) >> see below. >> >> But so far I can't say if you hit the same problem or not. If you can >> reproduce the problem, perhaps I can send you debugging patch? >> >> Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/