Sending again, this time as plain text (I hope)...

On 20 March 2015 at 18:46, Pavel Labath <lab...@google.com> wrote:
>
> Hi,
>
> thanks for the super quick response. :)
>
> I am at home now, so I don't have access to the same machine to run the test. 
> I will run it on monday and let you know.
>
> Meanwhile, I have tried running your test on my home machine, and it is 
> indeed reporting "unexpected wait: stat=57f". If I understand correctly, that 
> means the wait has reported sigtrap even though the tracee was in ptrace-stop.
>
> I can imagine that something similar is happening in our case. Since 
> PTRACE_CONT and waitpid calls are happening in different threads, I can't 
> positively say which one has occurred sooner. So far I have assumed the 
> sequence was PTRACE_CONT -> waitpid -> PTRACE_SIGINFO. However, if wait can 
> return even though the process is stopped then a possible sequence of events 
> is waitpid -> PTRACE_CONT -> PTRACE_SIGINFO, in which case it is not 
> surprising that the last call fails. One difference I see though is that in 
> our test, we are not sending any additional signals to the thread in question 
> (at least we shouldn't be sending them, but we are sending some signals to 
> other threads in the same process). Do you think it could still be the same 
> issue?
>
> I would be happy to test your patch. I don't think I can patch the kernel on 
> my work machine directly, but I think I might be able to set up some sort of 
> a test environment to try it out.
>
> regards,
> pavel
>
>
> On 20 March 2015 at 16:25, Oleg Nesterov <o...@redhat.com> wrote:
>>
>> Hi Pavel,
>>
>> let me add lkml, we should not discuss this offlist.
>>
>> On 03/20, Pavel Labath wrote:
>> >
>> > 1) we get a waitpid() notification that the tracee got SIGUSR1
>> > 2) we do a ptrace(GETSIGINFO) to get more info
>> > 3) eventually we decide to restart the tracee with PTRACE_CONT, passing it
>> > SIGUSR1
>> > 4) immediately after that we get another waitpid notification, again with
>> > SIGUSR1, even though the thread had received no additional signals
>> > 5) we again try to a GETSIGINFO, however this time it fails with ESRCH.
>> > Therefore, we assume that the thread has died
>>
>> I found a similar bug by code inspection some time ago. I even have
>> a fix, but I need to think more... And I even wrote the test-case ;)
>> see below.
>>
>> But so far I can't say if you hit the same problem or not. If you can
>> reproduce the problem, perhaps I can send you debugging patch?
>>
>> Oleg.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to