Hi, I have tested your patch and I can confirm that the error is gone when running a patched kernel. Thanks for fixing this in such a short notice.
I am still seeing one very rare failure where the SIGUSR does not appear to be reported. However, I will need to dig around this a bit more to make sure there is no error on our end. Now I am thinking about how to work around these bugs, as our code will need to run on unpatched kernels as well. As for this ptrace/waitpid race, I think I will just refactor the code to make wait and ptrace calls on the same thread. This should sidestep the race, right? Regarding your bug, I am not exactly sure what are the implications. Could you briefly describe the situations in which this behavior can occur? Am I correct in understanding that this is always a race between a SIGKILL and another non-lethal signal? And that the SIGKILL will be (eventually) reported? cheers, pl On 23 March 2015 at 09:42, Pavel Labath <lab...@google.com> wrote: > On 21 March 2015 at 18:57, Oleg Nesterov <o...@redhat.com> wrote: >> On 03/20, Pavel Labath wrote: >>> >>> One difference I see though is that in >>> our test, we are not sending any additional signals to the thread in >>> question (at least we shouldn't be sending them, but we are sending some >>> signals to other threads in the same process). Do you think it could still >>> be the same issue? >> >> Not sure... >> >> And. I found another race, which looks more promising wrt your description. >> ptrace_resume() sets ->exit_code before it wakes the tracee up. If the >> tracer's sub-thread calls wait() right after that, it can wrongly see >> task_stopped_code(tracee, true) != 0, as if the tracee reports its >> ->exit_code. > Yes, that sounds very much like what is happening. :) > > >> >>> I would be happy to test your patch. I don't think I can patch the kernel >>> on my work machine directly, but I think I might be able to set up some >>> sort of a test environment to try it out. >> >> Thanks! could you try the patch below? It won't help my test-case, but >> _perhaps_ it can fix the problem you hit? > Thanks for working on this. I will get on testing this. It will take a > while though, because the failure is quite rare and usually only shows > up in a full test suite run, which takes over 5 minutes. :) > >> >> And a couple of questions just in case... >> >> Which kernel version? Although probably this doesn't matter, this race >> is very-very old. > Linux labath 3.13.0-46-generic #79-Ubuntu SMP Tue Mar 10 20:06:50 UTC > 2015 x86_64 x86_64 x86_64 GNU/Linux > It should be the kernel shipped with Ubuntu 14.04 LTS. > >> >> Let me return to your description, >> >> 1) we get a waitpid() notification that the tracee got SIGUSR1 >> 2) we do a ptrace(GETSIGINFO) to get more info >> 3) eventually we decide to restart the tracee with PTRACE_CONT, >> passing it >> SIGUSR1 >> 4) immediately after that we get another waitpid notification, again >> with >> SIGUSR1, >> >> Does this "waitpid notification" mean that _another_ thread returns >> from waitpid() ? > Yes, we have one thread doing waitpid() processing in a loop, and > another thread is issuing PTRACE requests. > >> And status == (SIGUSR1 << 8) | 0x7f , yes? IOW, is WIFSTOPPED() true? > > Exactly. The following is an excerpt from our log, which I was able to > capture. > 548 1426700116.644772000 [7295/729a]: NativeProcessLinux::Resume() > resuming thread = 29346 with signal SIGUSR1 > 554 1426700116.644889000 [7295/729c]: > MonitorChildProcessThreadFunction ::waitpid (pid = -29339, &status, > options = 1073741824)... > 555 1426700116.644896000 [7295/729c]: > MonitorChildProcessThreadFunction ::waitpid (pid = -29339, &status, > options = 1073741824) => pid = 29346, status = 0x00000a7f (STOPPED), > signal = 10, exit_state = 0 > 556 1426700116.644906000 [7295/729a]: NativeProcessLinux::Resume() > resuming thread = 29346 result = true > > The PTRACE_CONT call happens somewhere between lines 548 and 556, > while the waitpid calls happens between 554 and 555. So it is quite > possible that they are called simultaneously... > > > Thanks for working on this. > > cheers, > pl -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/