On Mon, Feb 28, 2011 at 6:20 PM, Chet Ramey <chet.ra...@case.edu> wrote: > > The patch looks good. I'll take a closer look and probably produce a > patch for bash-4.2 based on it. Thanks for taking a look.
So I think that Oleg Nesterov is correct in that the -1 return with errno==EINTR will never actually trigger, because it is re-tried by the loop in "waitchld()". So I don't think my patch is really doing what it _intends_ to do. So in wait_for() (after my patch): r = waitchld (pid, 1); ... if (r == -1 && errno == EINTR && wait_sigint_received) { child_blocked_sigint = 1; } that child_blocked_sigint may never actually trigger. Which certainly explains why I couldn't reproduce the lost ^C any more: it really fixes the race condition, but it does it by basically never considering a child to block ^C. So I think the attached patch is better - it moves all the child_blocked_sigint logic down into waitchld() itself, so that it can really see the right error values. HOWEVER! When I do this, I can see the "lost ^C" issue again. It seems to be a bit harder than before, but I have a very hard time really judging it, it's subjective. But I get traces like this: ... 22:13:02.434759 rt_sigaction(SIGINT, {0x43c980, [], SA_RESTORER, 0x301f833140}, {SIG_DFL, [ 22:13:02.434790 wait4(-1, 0x7fffaa90befc, 0, NULL) = ? ERESTARTSYS (To be restarted) 22:13:02.434945 --- SIGINT (Interrupt) @ 0 (0) --- 22:13:02.434957 rt_sigreturn(0x2) = -1 EINTR (Interrupted system call) 22:13:02.434980 wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 4623 22:13:02.435005 rt_sigaction(SIGINT, {SIG_DFL, [], SA_RESTORER, 0x301f833140}, {0x43c980, [ 22:13:02.435041 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 22:13:02.435063 --- SIGCHLD (Child exited) @ 0 (0) --- 22:13:02.435074 wait4(-1, 0x7fffaa90ba5c, WNOHANG, NULL) = -1 ECHILD (No child processes) 22:13:02.435095 rt_sigreturn(0xffffffffffffffff) = 0 ... ie we hit the exact case where the ^C happens just as the child is exiting, so the child has already done the "exit(0)" system call, but the exit() takes long enough that bash has time to react to the ^C and see that EINTR. Note the timing: the SIGINT happens at 02.434945, with the wait returning successfully at 02.434980. We're talking microseconds, but the whole "fork+exec+wait" things are all very fast. So it's just the luck of the draw just where the ^C happens. I can't get it to happen reliably, but it is not entirely rare either. It probably needs a lot of luck, and SMP with some timing bad luck to trigger. And it probably helps that the Linux fork/exec cost is quite low, so being able to hit it just at the exit() is easier. Anyway - to recap: looking at EINTR isn't sufficient, and doesn't close the race. It's _really_ hard to try to decide on the ambiguous case of "child exited without errors on its own just as ^C came in" and "child blocked ^C and exited afterwards without errors". Anybody have any other heuristics to try to disambiguate the two cases? If we really are talking about "interactive programs that catch ^C" here, then it is possible that the only real heuristic is one that is based on time. How _long_ did it take for the process to exit? If the process exits without WIFSIGNALED a long time after ^C (where "long" is obviously only in computer terms), then we might assume it really blocked it and considered it actual input. I don't much like the idea of time-based heuristics, but right now I think bash resolves the ambiguity the wrong way around: bash prefers to err on the "drop ^C" side, even though it's likely to be the rare case. A real interactive program that uses ^C (like an editor) isn't actually ever going to see a SIGINT _at_all_, since it will set the tty state to -isig, and actually read the ^C as the character '\003' rather than have any SIGINT issues). I dunno. Linus
jobs.c | 36 ++++++++++++++++++++++++++++-------- 1 files changed, 28 insertions(+), 8 deletions(-) diff --git a/jobs.c b/jobs.c index df13ad9..95ad995 100644 --- a/jobs.c +++ b/jobs.c @@ -2193,7 +2193,7 @@ restore_sigint_handler () } } -static int wait_sigint_received; +static int wait_sigint_received, child_blocked_sigint; /* Handle SIGINT while we are waiting for children in a script to exit. The `wait' builtin should be interruptible, but all others should be @@ -2365,6 +2365,7 @@ wait_for (pid) /* This is possibly a race condition -- should it go in stop_pipeline? */ wait_sigint_received = 0; + child_blocked_sigint = 0; if (job_control == 0 || (subshell_environment&SUBSHELL_COMSUB)) { old_sigint_handler = set_signal_handler (SIGINT, wait_sigint_handler); @@ -2461,6 +2462,9 @@ wait_for (pid) } while (PRUNNING (child) || (job != NO_JOB && RUNNING (job))); + /* Restore the original SIGINT signal handler. */ + restore_sigint_handler (); + /* The exit state of the command is either the termination state of the child, or the termination state of the job. If a job, the status of the last child in the pipeline is the significant one. If the command @@ -2556,10 +2560,9 @@ if (job == NO_JOB) pass the status back to our parent. */ s = job_signal_status (job); - if (WIFSIGNALED (s) && WTERMSIG (s) == SIGINT && signal_is_trapped (SIGINT) == 0) + if (!child_blocked_sigint && signal_is_trapped (SIGINT) == 0) { UNBLOCK_CHILD (oset); - restore_sigint_handler (); old_sigint_handler = set_signal_handler (SIGINT, SIG_DFL); if (old_sigint_handler == SIG_IGN) restore_sigint_handler (); @@ -2585,9 +2588,6 @@ wait_for_return: UNBLOCK_CHILD (oset); - /* Restore the original SIGINT signal handler before we return. */ - restore_sigint_handler (); - return (termination_state); } @@ -3087,9 +3087,29 @@ waitchld (wpid, block) /* If waitpid returns 0, there are running children. If it returns -1, the only other error POSIX says it can return is EINTR. */ CHECK_TERMSIG; + + /* If the waitchld returned EINTR, and the shell got a SIGINT, + then the child has not died yet, and we assume that the + child has blocked SIGINT. In that case, we require that the + child return with WSIGTERM() == SIGINT to actually consider + the ^C relevant. This is racy (the child may be in the + process of exiting and bash reacted to the EINTR first), + but this makes the race window much much smaller */ + if (pid < 0 && errno == EINTR && wait_sigint_received) + { + child_blocked_sigint = 1; + } + if (pid <= 0) continue; /* jumps right to the test */ + /* If the child explicitly exited with a SIGINT, we dismiss the + fact that we guessed it blocked sigint earlier */ + if (WIFSIGNALED(status) && WTERMSIG(status) == SIGINT) + { + child_blocked_sigint = 0; + } + /* children_exited is used to run traps on SIGCHLD. We don't want to run the trap if a process is just being continued. */ if (WIFCONTINUED(status) == 0) @@ -3277,7 +3297,7 @@ set_job_status_and_cleanup (job) does not exit due to SIGINT, run the trap handler but do not otherwise act as if we got the interrupt. */ if (wait_sigint_received && interactive_shell == 0 && - WIFSIGNALED (child->status) == 0 && IS_FOREGROUND (job) && + child_blocked_sigint && IS_FOREGROUND (job) && signal_is_trapped (SIGINT)) { int old_frozen; @@ -3299,7 +3319,7 @@ set_job_status_and_cleanup (job) signals are sent to process groups) or via kill(2) to the foreground process by another process (or itself). If the shell did receive the SIGINT, it needs to perform normal SIGINT processing. */ - else if (wait_sigint_received && (WTERMSIG (child->status) == SIGINT) && + else if (wait_sigint_received && !child_blocked_sigint && IS_FOREGROUND (job) && IS_JOBCONTROL (job) == 0) { int old_frozen;