We've encountered zombies that are waiting for a thread to exit that are looping in ep_poll() almost endlessly although there is a pending SIGKILL as a result of a group exit.
This happens because we always find ep_events_available() and fetch more events and never are able to check for signal_pending() that would break from the loop and return -EINTR. Special case fatal signals and break immediately to guarantee that we loop to fetch more events and delay making a timely exit. It would also be possible to simply move the check for signal_pending() higher than checking for ep_events_available(), but there have been no reports of delayed signal handling other than SIGKILL preventing zombies from exiting that would be fixed by this. Signed-off-by: David Rientjes <rient...@google.com> --- fs/eventpoll.c | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/fs/eventpoll.c b/fs/eventpoll.c --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -1748,6 +1748,16 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events, * to TASK_INTERRUPTIBLE before doing the checks. */ set_current_state(TASK_INTERRUPTIBLE); + /* + * Always short-circuit for fatal signals to allow + * threads to make a timely exit without the chance of + * finding more events available and fetching + * repeatedly. + */ + if (fatal_signal_pending(current)) { + res = -EINTR; + break; + } if (ep_events_available(ep) || timed_out) break; if (signal_pending(current)) {