New submission from Adin Scannell <a...@scannell.ca>:

While running a complex python process that executes a bunch of subprocesses 
(using the subprocess module, specifically calling communicate()), I found 
myself with occasional zombie processes piling up. Turns out Python is not 
correctly wait()ing for the children. Although in my case it happens for < 5% 
of subprocesses, it happens for random Popen objects, used in different ways 
(using Popen() and then read()/write()/wait() directly or with communicate()). 
I'd love to find out I'm crazy, but I'm not doing anything too sneaky and the 
patch below fixes the problem.

I'm not sure why it's happening in my particular environment (maybe it just so 
happens that the child processes enter into states with particular timing, or 
the parent receives signals at the wrong moments) but it's very reproducible 
for me.

I believe that the cause of the zombie processes is as follows:

If you read the description of the waitpid system call 
(http://www.kernel.org/doc/man-pages/online/pages/man2/wait.2.html), there are 
several events that could cause waitpid() to return. I have no idea why, but 
even without WNOHANG set, it looks I'm getting back an occasional 0 return 
value from waitpid(). Interrupted system call? Stopped child process? Not sure 
why at the moment. The documentation is a bit ambiguous as to whether this can 
happen, BUT looking at the example code at the bottom, it seems to handle this 
spurious wakeup case (which subprocess does not). The net result is that this 
process has *not* exited or been killed. The python code paths don't consider 
this possibility (as I believe in normal circumstances, it rarely happens!).

I discovered this bug on 2.7.2. I've prepared a patch for the 2.7 branch 
(75701:d46c1973d3c4), although I'm certain almost all versions, including the 
tip suffer from this problem. I'm happy to port to other branches if necessary, 
although I think appropriate maintainers could whip it up in no time flat. I've 
tested my 2.7 fix and it solves my problem -- no more zombies. This patch does 
not change the behaviour of the Popen class in the normal case but allows it to 
handle spurious wakeups.

----------
components: Library (Lib)
files: waitpid-2.7.patch
keywords: patch
messages: 156683
nosy: amscanne
priority: normal
severity: normal
status: open
title: Popen wait() doesn't handle spurious wakeups
type: behavior
versions: Python 2.6, Python 2.7, Python 3.1, Python 3.2, Python 3.3, Python 3.4
Added file: http://bugs.python.org/file25009/waitpid-2.7.patch

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue14396>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to