On 3/9/2017 3:57 PM, Bryan Drewery wrote:
> On 3/9/2017 3:47 PM, Bryan Drewery wrote:
>> On 3/9/2017 3:11 PM, Jilles Tjoelker wrote:
>>> On Thu, Mar 09, 2017 at 04:46:46PM +0200, Konstantin Belousov wrote:
>>>> Yes, there is a race, apparently, with the child zombie still not finishing
>>>> sending the SIGCHLD to the parent and parent exiting. The following should
>>>> fix the issue, but I do not think that reproducing the problem is easy.
>>>
>>>> diff --git a/sys/kern/kern_exit.c b/sys/kern/kern_exit.c
>>>> index c524fe5df37..ba5ff84e9de 100644
>>>> --- a/sys/kern/kern_exit.c
>>>> +++ b/sys/kern/kern_exit.c
>>>> @@ -189,6 +189,7 @@ exit1(struct thread *td, int rval, int signo)
>>>> {
>>>> struct proc *p, *nq, *q, *t;
>>>> struct thread *tdt;
>>>> + ksiginfo_t ksi;
>>>>
>>>> mtx_assert(&Giant, MA_NOTOWNED);
>>>> KASSERT(rval == 0 || signo == 0, ("exit1 rv %d sig %d", rval, signo));
>>>> @@ -456,7 +457,12 @@ exit1(struct thread *td, int rval, int signo)
>>>> proc_reparent(q, q->p_reaper);
>>>> if (q->p_state == PRS_ZOMBIE) {
>>>> PROC_LOCK(q->p_reaper);
>>>> - pksignal(q->p_reaper, SIGCHLD, q->p_ksi);
>>>> + if (q->p_ksi != NULL) {
>>>> + ksiginfo_init(&ksi);
>>>> + ksiginfo_copy(q->p_ksi, &ksi);
>>>> + }
>>>> + pksignal(q->p_reaper, SIGCHLD, q->p_ksi !=
>>>> + NULL ? &ksi : NULL);
>>>> PROC_UNLOCK(q->p_reaper);
>>>> }
>>>> } else {
>>
>> I just got something weird with this patch that wasn't happening before:
>>
>> /usr/bin/time -l src/bin/poudriere -e /usr/local/etc testport -j
>> exp-10amd64 -p commit -z test devel/ccache
>> [poudriere runs and completes with exit status 0]
>>> time: command terminated abnormally
>>>
>>> 28.08 real 9.92 user 10.38 sys
>>>
>>> 23464 maximum resident set size
>>>
>>> 4996 average shared memory size
>>>
>>> 88 average unshared data size
>>>
>>> 127 average unshared stack size
>>>
>>> 282705 page reclaims
>>>
>>> 5623 page faults
>>>
>>> 0 swaps
>>>
>>> 2673 block input operations
>>>
>>> 4836 block output operations
>>>
>>> 33 messages sent
>>>
>>> 0 messages received
>>>
>>> 37 signals received
>>>
>>> 11226 voluntary context switches
>>>
>>> 780 involuntary context switches
>>>
>>> zsh: alarm /usr/bin/time -l src/bin/poudriere -e /usr/local/etc
>>> testport -j exp-10amd64
>> exit status: 142 (SIGALRM).
>>
>> I don't see time(1) using SIGALRM or proc reaper at all.
>>
>> Rerunning it, and trying other simpler test cases, does not produce the
>> same result. It may be some race unrelated to this patch, dunno.
>>
>
> I'm consistently getting foreground processes getting the wrong signals
> now. I'm removing this patch for now.It wasn't this patch doing this. Something else is very wrong with signal handling right now. > >> >>> >>> This patch introduces a subtle correctness bug. A real SIGCHLD ksiginfo >>> should always be the zombie's p_ksi; otherwise, the siginfo may be lost >>> if there are too many signals pending for the target process or in the >>> system. If the siginfo is lost and the reaper normally passes si_pid to >>> waitpid() or similar (instead of passing WAIT_ANY or P_ALL), a zombie >>> will remain until the reaper terminates. >>> >>> Conceptually the siginfo is sent to one process at a time only, so the >>> bug is an artifact of the implementation. Perhaps the piece of code >>> added in r309886 can be moved or the ksiginfo can be removed from the >>> parent's queue. >>> >>> If such a fix is not possible, it may be better to send a bare SIGCHLD >>> (si_code is SI_KERNEL or 0, depending on how many signals are pending) >>> in this situation and document that reapers must use WAIT_ANY or P_ALL. >>> (However, compared to the pre-r309886 situation they can still use >>> SIGCHLD to get notified when to call waitpid() or similar.) >>> >> >> > > -- Regards, Bryan Drewery
signature.asc
Description: OpenPGP digital signature
