[issue9205] Parent process hanging in multiprocessing if children terminate unexpectedly
Greg Brockman added the comment: I'll take another stab at this. In the attachment (assign-tasks.patch), I've combined a lot of the ideas presented on this issue, so thank you both for your input. Anyway: - The basic idea of the patch is to record the mapping of tasks to workers. I've added a protocol between the parent process and the workers that allows this to happen without adding a race condition between recording the task and the child dying. - If a child unexpectedly dies, the worker_handler pretends that all of the jobs currently assigned to it raised a RuntimeError. (Multiple jobs can be assigned to a single worker if the result handler is being slow.) - The guarantee I try to provide is that each job will be started at most once. There is enough information to instead ensure that each job is run exactly once, but in general whether that's acceptable or useful is really only known at the application level. Some notes: - I haven't implemented this for approach for the ThreadPool yet. - The test suite runs but occasionally hangs on shutting down the pool in Ask's tests in multiprocessing-tr...@82502-termination-trackjobs.patch. My experiments seem to indicate this is due to a worker dying while holding a queue lock. So I think a next step is to deal with workers dying while holding a queue lock, although this seems unlikely in practice. I have some ideas as to how you could fix this, if we decide it's worth trying. Anyway, please let me know what you think of this approach/sample implementation. If we decide that this seems promising, I'd be happy to built it out further. -- Added file: http://bugs.python.org/file18513/assign-tasks.patch ___ Python tracker <http://bugs.python.org/issue9205> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue9205] Parent process hanging in multiprocessing if children terminate unexpectedly
Greg Brockman added the comment: Thanks for looking at it! Basically this patch requires the parent process to be able to send a message to a particular worker. As far as I can tell, the existing queues allow the children to send a message to the parent, or the parent to send a message to one child (whichever happens to win the race; not a particular one). I don't love introducing one queue per child either, although I don't have a sense of how much overhead that would add. Does the problem make sense/do you have any ideas for an alternate solution? -- ___ Python tracker <http://bugs.python.org/issue9205> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue8296] multiprocessing.Pool hangs when issuing KeyboardInterrupt
Changes by Greg Brockman : -- nosy: +gdb ___ Python tracker <http://bugs.python.org/issue8296> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue9205] Parent process hanging in multiprocessing if children terminate unexpectedly
Greg Brockman added the comment: Hmm, a few notes. I have a bunch of nitpicks, but those can wait for a later iteration. (Just one style nit: I noticed a few unneeded whitespace changes... please try not to do that, as it makes the patch harder to read.) - Am I correct that you handle a crashed worker by aborting all running jobs? If so: - Is this acceptable for your use case? I'm fine with it, but had been under the impression that we would rather this did not happen. - If you're going to the effort of ACKing, why not record the mapping of tasks to workers so you can be more selective in your termination? Otherwise, what does the ACKing do towards fixing this particular issue? - I think in the final version you'd need to introduce some interthread locking, because otherwise you're going to have weird race conditions. I haven't thought too hard about whether you can get away with just catching unexpected exceptions, but it's probably better to do the locking. - I'm getting hangs infrequently enough to make debugging annoying, and I don't have time to track down the bug right now. Why don't you strip out any changes that are not needed (e.g. AFAICT, the ACK logic), make sure there aren't weird race conditions, and if we start converging on a patch that looks right from a high level we can try to make it work on all the corner cases? -- ___ Python tracker <http://bugs.python.org/issue9205> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue9205] Parent process hanging in multiprocessing if children terminate unexpectedly
Greg Brockman added the comment: Ah, you're right--sorry, I had misread your code. I hadn't noticed the usage of the worker_pids. This explains what you're doing with the ACKs. Now, the problem is, I think doing it this way introduces some races (which is why I introduced the ACK from the task handler in my most recent patch). What happens if: - A worker removes a job from the queue and is killed before sending an ACK. - A worker removes a job from the queue, sends an ACK, and then is killed. Due to bad luck with the scheduler, the parent cleans the worker before the parent has recorded the worker pid. You're now reading from self._cache in one thread but writing it in another. What happens if a worker sends a result and then is killed? Again, I haven't thought too hard about what will happen here, so if you have a correctness argument for why it's safe as-is I'd be happy to hear it. Also, I just noted that your current way of dealing with child deaths doesn't play well with the maxtasksperchild variable. In particular, try running: """ import multiprocessing def foo(x): return x multiprocessing.Pool(1, maxtasksperchild=1).map(foo, [1, 2, 3, 4]) """ (This should be an easy fix.) -- ___ Python tracker <http://bugs.python.org/issue9205> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4106] multiprocessing occasionally spits out exception during shutdown
Greg Brockman added the comment: For what it's worth, I think I have a simpler reproducer of this issue. Using freshly-compiled python-from-trunk (as well as multiprocessing-from-trunk), I get tracebacks from the following about 30% of the time: """ import multiprocessing, time def foo(x): time.sleep(3) multiprocessing.Pool(1).apply(foo, [1]) """ My tracebacks are of the form: """ Exception in thread Thread-1 (most likely raised during interpreter shutdown): Traceback (most recent call last): File "/usr/local/lib/python2.7/threading.py", line 530, in __bootstrap_inner File "/usr/local/lib/python2.7/threading.py", line 483, in run File "/usr/local/lib/python2.7/multiprocessing/pool.py", line 272, in _handle_workers : 'NoneType' object is not callable """ -- nosy: +gdb ___ Python tracker <http://bugs.python.org/issue4106> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4106] multiprocessing occasionally spits out exception during shutdown
Greg Brockman added the comment: I'm on Ubuntu 10.04, 64 bit. -- ___ Python tracker <http://bugs.python.org/issue4106> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue9205] Parent process hanging in multiprocessing if children terminate unexpectedly
New submission from Greg Brockman : I have recently begun using multiprocessing for a variety of batch jobs. It's a great library, and it's been quite useful. However, I have been bitten several times by situations where a worker process in a Pool will unexpectedly die, leaving multiprocessing hanging in a wait. A simple example of this is produced by the following: """ #!/usr/bin/env python import multiprocessing, sys def foo(x): sys.exit(1) multiprocessing.Pool(1).apply(foo, [1]) """ The child will exit and the parent will hang forever. A similar occurrence happens if one pushes C-c while a child process is running (this special case is noted in http://bugs.python.org/issue8296) or killed by a signal. Attached is a patch to handle unexpected terminations of children processes and prevent the parent process from hanging. A test case is included. (Developed and tested on 64-bit Ubuntu.) Please let me know what you think. Thanks! -- components: Library (Lib) files: termination.patch keywords: patch messages: 109585 nosy: gdb priority: normal severity: normal status: open title: Parent process hanging in multiprocessing if children terminate unexpectedly type: behavior versions: Python 2.6, Python 2.7 Added file: http://bugs.python.org/file17905/termination.patch ___ Python tracker <http://bugs.python.org/issue9205> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue9207] multiprocessing occasionally spits out exception during shutdown
New submission from Greg Brockman : On Ubuntu 10.04, using freshly-compiled python-from-trunk (as well as multiprocessing-from-trunk), I get tracebacks from the following about 30% of the time: """ import multiprocessing, time def foo(x): time.sleep(3) multiprocessing.Pool(1).apply(foo, [1]) """ My tracebacks are of the form: """ Exception in thread Thread-1 (most likely raised during interpreter shutdown): Traceback (most recent call last): File "/usr/local/lib/python2.7/threading.py", line 530, in __bootstrap_inner File "/usr/local/lib/python2.7/threading.py", line 483, in run File "/usr/local/lib/python2.7/multiprocessing/pool.py", line 272, in _handle_workers : 'NoneType' object is not callable """ This information was originally posted to http://bugs.python.org/issue4106. -- components: Library (Lib) messages: 109588 nosy: gdb priority: normal severity: normal status: open title: multiprocessing occasionally spits out exception during shutdown type: behavior versions: Python 2.6, Python 2.7 ___ Python tracker <http://bugs.python.org/issue9207> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4106] multiprocessing occasionally spits out exception during shutdown
Greg Brockman added the comment: Sure thing. See http://bugs.python.org/issue9207. -- ___ Python tracker <http://bugs.python.org/issue4106> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue9207] multiprocessing occasionally spits out exception during shutdown (_handle_workers)
Greg Brockman added the comment: That's likely a mistake on my part. I'm not observing this using the stock version of multiprocessing on my Ubuntu machine(after running O(100) times). I do, however, observe it when using either python2.7 or python2.6 with multiprocessing-from-trunk, if that's interesting. I'm not really sure what the convention is here; should this be filed just under Python 2.7? Thanks. -- ___ Python tracker <http://bugs.python.org/issue9207> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue9207] multiprocessing occasionally spits out exception during shutdown (_handle_workers)
Greg Brockman added the comment: No, I'm not using the Google code backport. To be clear, I've tried testing this with two versions of multiprocessing: - multiprocessing-from-trunk (r82645): I get these exceptions with ~40% frequency - multiprocessing from Ubuntu 10.04 (version 0.70a1): No such exceptions observed Out of curiosity, I did just try this with the processing library (version 0.52) on a 64-bit Debian Lenny box, and did not observe these exceptions. Hope that's useful! -- ___ Python tracker <http://bugs.python.org/issue9207> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue9207] multiprocessing occasionally spits out exception during shutdown (_handle_workers)
Greg Brockman added the comment: > Wait - so, you are pulling svn trunk, compiling and running your test > with the built python executable? Yes. I initially observed this issue while using 10.04's Python (2.6.5), but wanted to make sure it wasn't fixed by using a newer interpreter. > I'm not following the "multiprocessing-from-trunk" distinction unless > > you're picking the module out of the tree / compiling it and then > moving it into some other install. I might be being overly dense. Initially I was doing exactly that. (Some context: I was working on a patch to fix a different multiprocessing issue, and figured I may as well write my patch against the most recent version of the library.) Note that I was using Lucid's _multiprocessing, so there was no compilation involved. > You're running your test with cd src/tree/ && ./python - > right? What... is src/tree? If it's what you're asking, I am running the freshly-compiled python interpreter, and it does seem to be using the relevant modules out of trunk: >>> import threading; threading.__file__ '/usr/local/lib/python2.7/threading.pyc' >>> import multiprocessing; multiprocessing.__file__ '/usr/local/lib/python2.7/multiprocessing/__init__.pyc' >>> import _multiprocessing; _multiprocessing.__file__ '/usr/local/lib/python2.7/lib-dynload/_multiprocessing.so' When running with 2.6, all modules are whatever's available for 10.04 except for the multiprocessing that I took from trunk: >>> import threading; threading.__file__ '/usr/lib/python2.6/threading.pyc' >>> import multiprocessing; multiprocessing.__file__ 'multiprocessing/__init__.pyc' >>> import _multiprocessing; _multiprocessing.__file__ '/usr/lib/python2.6/lib-dynload/_multiprocessing.so' > Also, what, if any, compile flags are you passing to the python build? I just ran ./configure && make && make install Sorry about the confusion--let me know if you'd like additional information. I can test on other platforms/with other configurations if it would be useful. -- ___ Python tracker <http://bugs.python.org/issue9207> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue9207] multiprocessing occasionally spits out exception during shutdown (_handle_workers)
Greg Brockman added the comment: Yeah, I've just taken a checkout from trunk, ran './configure && make && make install', and reproduced on: - Ubuntu 10.04 32-bit - Ubuntu 9.04 32-bit -- ___ Python tracker <http://bugs.python.org/issue9207> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue9207] multiprocessing occasionally spits out exception during shutdown (_handle_workers)
Greg Brockman added the comment: With the line commented out, I no longer see any exceptions. Although, if I understand what's going on, there still a (much rarer) possibility of an exception, right? I guess in the common case, the worker_handler is in the sleep when shutdown begins. But if it happens to be in the in the _maintain_pool step, would you still get these exceptions? -- ___ Python tracker <http://bugs.python.org/issue9207> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue9207] multiprocessing occasionally spits out exception during shutdown (_handle_workers)
Greg Brockman added the comment: Think http://www.mail-archive.com/python-l...@python.org/msg282114.html is relevant? -- ___ Python tracker <http://bugs.python.org/issue9207> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue9205] Parent process hanging in multiprocessing if children terminate unexpectedly
Greg Brockman added the comment: Cool, thanks. I'll note that with this patch applied, using the test program from 9207 I consistently get the following exception: """ Exception in thread Thread-1 (most likely raised during interpreter shutdown): Traceback (most recent call last): File "/usr/lib/python2.6/threading.py", line 532, in __bootstrap_inner File "/usr/lib/python2.6/threading.py", line 484, in run File "/home/gdb/repositories/multiprocessing/pool.py", line 312, in _handle_workers File "/home/gdb/repositories/multiprocessing/pool.py", line 190, in _maintain_pool File "/home/gdb/repositories/multiprocessing/pool.py", line 158, in _join_exited_workers : 'NoneType' object is not callable """ This is line 148 in the unpatched source, namely the 'reversed(range(len(self._pool)))' line of _join_exited_workers. Looks like the same issue, where instead reversed/range/len have been set to None. So I think by changing how much time the worker_handler spends in various functions, I've made it possible (or just more likely?) that if we lose the race with interpreter shutdown the worker_handler will be in the middle of _join_exited_workers. This may mean that someone should keep around a local reference to reversed/range/len... not sure if there's a better solution. -- ___ Python tracker <http://bugs.python.org/issue9205> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue9205] Parent process hanging in multiprocessing if children terminate unexpectedly
Greg Brockman added the comment: What about just catching the exception? See e.g. the attached patch. (Disclaimer: not heavily tested). -- Added file: http://bugs.python.org/file17934/shutdown.patch ___ Python tracker <http://bugs.python.org/issue9205> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue9207] multiprocessing occasionally spits out exception during shutdown (_handle_workers)
Greg Brockman added the comment: With pool.py:272 commented out, running about 50k iterations, I saw 4 tracebacks giving an exception on pool.py:152. So this seems to imply the race does exist (i.e. that the thread is in _maintain_pool rather than time.sleep when shutdown begins). It looks like the _maintain_pool run takes O(10^-4)s, so it's not surprising the error is so rare. That being said, the patch I submitted in issue 9205 should handle this case as well. -- ___ Python tracker <http://bugs.python.org/issue9207> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue9205] Parent process hanging in multiprocessing if children terminate unexpectedly
Greg Brockman added the comment: Thanks much for taking a look at this! > why are you terminating the second pass after finding a failed > process? Unfortunately, if you've lost a worker, you are no longer guaranteed that cache will eventually be empty. In particular, you may have lost a task, which could result in an ApplyResult waiting forever for a _set call. More generally, my chief assumption that went into this is that the unexpected death of a worker process is unrecoverable. It would be nice to have a better workaround than just aborting everything, but I couldn't see a way to do that. > Unpickleable errors and other errors occurring in the worker body are > not exceptional cases, at least not now that the pool is supervised > by _handle_workers. I could be wrong, but that's not what my experiments were indicating. In particular, if an unpickleable error occurs, then a task has been lost, which means that the relevant map, apply, etc. will wait forever for completion of the lost task. > I think the result should be set also in this case, so the user can > inspect the exception after the fact. That does sound useful. Although, how can you determine the job (and the value of i) if it's an unpickleable error? It would be nice to be able to retrieve job/i without having to unpickle the rest. > For shutdown.patch, I thought this only happened in the worker > handler, but you've enabled this for the result handler too? I don't > care about the worker handler, but with the result handler I'm > worried that I don't know what ignoring these exceptions actually > means. You have a good point. I didn't think about the patch very hard. I've only seen these exceptions from the worker handler, but AFAICT there's no guarantee that bad luck with the scheduler wouldn't result in the same problem in the result handler. One option would be to narrow the breadth of the exceptions caught by _make_shutdown_safe (do we need to catch anything but TypeErrors?). Another option would be to enable only for the worker handler. I don't have a particularly great sense of what the Right Thing to do here is. -- ___ Python tracker <http://bugs.python.org/issue9205> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue9205] Parent process hanging in multiprocessing if children terminate unexpectedly
Greg Brockman added the comment: > For processes disappearing (if that can at all happen), we could solve > that by storing the jobs a process has accepted (started working on), > so if a worker process is lost, we can mark them as failed too. Sure, this would be reasonable behavior. I had considered it but decided it as a larger change than I wanted to make without consulting the devs. > I was already working on this issue last week actually, and I managed > to do that in a way that works well enough (at least for me): If I'm reading this right, you catch the exception upon pickling the result (at which point you have the job/i information already; totally reasonable). I'm worried about the case of unpickling the task failing. (Namely, the "task = get()" line of the "worker" method.) Try running the following: """ #!/usr/bin/env python import multiprocessing p = multiprocessing.Pool(1) def foo(x): pass p.apply(foo, [1]) """ And if "task = get()" fails, then the worker doesn't know what the relevant job/i values are. Anyway, so I guess the question that is forming in my mind is, what sorts of errors do we want to handle, and how do we want to handle them? My answer is I'd like to handle all possible errors with some behavior that is not "hang forever". This includes handling children processes dying by signals or os._exit, raising unpickling errors, etc. I believe my patch provides this functionality. By adding the extra mechanism that you've written/proposed, we can improve the error handling in specific recoverable cases (which probably constitute the vast majority of real-world cases). -- ___ Python tracker <http://bugs.python.org/issue9205> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue9205] Parent process hanging in multiprocessing if children terminate unexpectedly
Greg Brockman added the comment: > What kind of errors are you having that makes the get() call fail? Try running the script I posted. It will fail with an AttributeError (raised during unpickling) and hang. I'll note that the particular issues that I've run into in practice are: - OOM kill destroying my workers but leaving the parent silently waiting - KeyboardInterrupting the workers, and then having the parent hang This AttributeError problem is one that I discovered while generating test cases for the patch. -- ___ Python tracker <http://bugs.python.org/issue9205> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue9244] multiprocessing.pool: Worker crashes if result can't be encoded
Greg Brockman added the comment: This looks pretty reasonable to my untrained eye. I successfully applied and ran the test suite. To be clear, the errback change and the unpickleable result change are actually orthogonal, right? Anyway, I'm not really familiar with the protocol here, but assuming that you're open to code review: > -def apply_async(self, func, args=(), kwds={}, callback=None): > +def apply_async(self, func, args=(), kwds={}, callback=None, > +error_callback=None): > ''' > Asynchronous equivalent of `apply()` builtin > ''' > assert self._state == RUN > -result = ApplyResult(self._cache, callback) > +result = ApplyResult(self._cache, callback, error_callback) > self._taskqueue.put(([(result._job, None, func, args, kwds)], None)) > return result Sure. Why not add an error_callback for map_async as well? > -def __init__(self, cache, callback): > +def __init__(self, cache, callback, error_callback=None): > self._cond = threading.Condition(threading.Lock()) > self._job = job_counter.next() > self._cache = cache > self._ready = False > self._callback = callback > +self._errback = error_callback > cache[self._job] = self Any reason you chose to use a different internal name (errback versus error_callback)? It seems cleaner to me to be consistent about the name. > def sqr(x, wait=0.0): > time.sleep(wait) > return x*x > + > class _TestPool(BaseTestCase): > def test_apply(self): > @@ -1020,6 +1021,7 @@ class _TestPool(BaseTestCase): > self.assertEqual(get(), 49) > self.assertTimingAlmostEqual(get.elapsed, TIMEOUT1) > > + > def test_async_timeout(self): In general, I'm wary of nonessential whitespace changes... did you mean to include these? > +scratchpad = [None] > +def errback(exc): > +scratchpad[0] = exc > + > +res = p.apply_async(raising, error_callback=errback) > +self.assertRaises(KeyError, res.get) > +self.assertTrue(scratchpad[0]) > +self.assertIsInstance(scratchpad[0], KeyError) > + > +p.close() Using "assertTrue" seems misleading. "assertIsNotNone" is what you really mean, right? Although, I believe that's redundant, since presumably self.assertIsInstance(None, KeyError) will error out anyway (I haven't verified this). > +def test_unpickleable_result(self): > +from multiprocessing.pool import MaybeEncodingError > +p = multiprocessing.Pool(2) > + > +# Make sure we don't lose pool processes because of encoding errors. > +for iteration in xrange(20): > + > +scratchpad = [None] > +def errback(exc): > +scratchpad[0] = exc > + > +res = p.apply_async(unpickleable_result, error_callback=errback) > +self.assertRaises(MaybeEncodingError, res.get) > +wrapped = scratchpad[0] > +self.assertTrue(wrapped) Again, assertTrue is probably not what you want, and is probably redundant. > +self.assertIsInstance(scratchpad[0], MaybeEncodingError) Why use scratchpad[0] rather than wrapped? > +self.assertIsNotNone(wrapped.exc) > +self.assertIsNotNone(wrapped.value) Under what circumstances would these be None? (Perhaps you want wrapped.exc != 'None'?) The initializer for MaybeEncodingError enforces the invariant that exc/value are strings, right? > + > class _TestPoolWorkerLifetime(BaseTestCase): > > ALLOWED_TYPES = ('processes', ) Three line breaks there seems excessive. -- ___ Python tracker <http://bugs.python.org/issue9244> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue9205] Parent process hanging in multiprocessing if children terminate unexpectedly
Greg Brockman added the comment: While looking at your patch in issue 9244, I realized that my code fails to handle an unpickleable task, as in: """ #!/usr/bin/env python import multiprocessing foo = lambda x: x p = multiprocessing.Pool(1) p.apply(foo, [1]) """ This should be fixed by the attached pickling_error.patch (independent of my other patches). -- Added file: http://bugs.python.org/file17987/pickling_error.patch ___ Python tracker <http://bugs.python.org/issue9205> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue9205] Parent process hanging in multiprocessing if children terminate unexpectedly
Greg Brockman added the comment: Before I forget, looks like we also need to deal with the result from a worker being un-unpickleable: """ #!/usr/bin/env python import multiprocessing def foo(x): global bar def bar(x): pass return bar p = multiprocessing.Pool(1) p.apply(foo, [1]) """ This shouldn't require much more work, but I'll hold off on submitting a patch until we have a better idea of where we're going in this arena. > Instead of restarting crashed worker processes it will simply bring down > the pool, right? Yep. Again, as things stand, once you've lost an worker, you've lost a task, and you can't really do much about it. I guess that depends on your application though... is your use-case such that you can lose a task without it mattering? If tasks are idempotent, one could have the task handler resubmit them, etc.. But really, thinking about the failure modes I've seen (OOM kills/user-initiated interrupt) I'm not sure under what circumstances I'd like the pool to try to recover. The idea of recording the mapping of tasks -> workers seems interesting. Getting all of the corner cases could be hard (e.g. making removing a task from the queue and recording which worker did the removing atomic, detecting if the worker crashed while still holding the queue lock) and doing this would require extra mechanism. This feature does seem to be useful for pools running many different jobs, because that way a crashed worker need only terminate one job. Anyway, I'd be curious to know more about the kinds of crashes you've encountered from which you'd like to be able to recover. Is it just Unpickleable exceptions, or are there others? -- ___ Python tracker <http://bugs.python.org/issue9205> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue9205] Parent process hanging in multiprocessing if children terminate unexpectedly
Greg Brockman added the comment: >> Before I forget, looks like we also need to deal with the >> result from a worker being un-unpickleable: >This is what my patch in bug 9244 does... Really? I could be misremembering, but I believe you deal with the case of the result being unpickleable. I.e. you deal with the put(result) failing, but not the get() in the result handler. Does my sample program work with your patch applied? > while state != TERMINATE: > result = get(timeout=1) > if all_processes_dead(): > break; Will this sort of approach work with the supervisor, which continually respawns workers? > user-initiated interrupts, this is very important to recover from, > think of some badly written library code suddenly raising SystemExit, > this shouldn't terminate other jobs, and it's probably easy to > recover from, so why shouldn't it try? To be clear, in this case I was thinking of KeyboardInterrupts. I'll take a look at your patch in a bit. From our differing use-cases, I do think it could make sense as a configuration option, but where it probably belongs is on the wait() call of ApplyResult. -- ___ Python tracker <http://bugs.python.org/issue9205> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue9205] Parent process hanging in multiprocessing if children terminate unexpectedly
Greg Brockman added the comment: Actually, the program you demonstrate is nonequivalent to the one I posted. The one I posted pickles just fine because 'bar' is a global name, but doesn't unpickle because it doesn't exist in the parent's namespace. (See http://docs.python.org/library/pickle.html#what-can-be-pickled-and-unpickled.) Although, if you're able to run my test program verbatim, then it's entirely possible I'm just missing something. Anyway, I do think that adding a 'worker_missing_callback' could work. You'd still have to make sure the ApplyResult (or MapResult) can crash the pool if it deems necessary though. -- ___ Python tracker <http://bugs.python.org/issue9205> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue9205] Parent process hanging in multiprocessing if children terminate unexpectedly
Greg Brockman added the comment: Started looking at your patch. It seems to behave reasonably, although it still doesn't catch all of the failure cases. In particular, as you note, crashed jobs won't be noticed until the pool shuts down... but if you make a blocking call such as in the following program, you'll get a hang: """ #!/usr/bin/env python import multiprocessing, os, signal def foo(x): os.kill(os.getpid(), signal.SIGKILL) multiprocessing.Pool(1).apply(foo, [1]) """ The tests also occasionally hang in e.g. test_job_killed_by_signal (__main__.WithProcessesTestPoolSupervisor) ... -- ___ Python tracker <http://bugs.python.org/issue9205> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue9205] Parent process hanging in multiprocessing if children terminate unexpectedly
Greg Brockman added the comment: At first glance, looks like there are a number of sites where you don't change the blocking calls to non-blocking calls (e.g. get()). Almost all of the get()s have the potential to be called when there is no possibility for them to terminate. I might recommend referring to my original termination.patch... I believe I tracked down the majority of such blocking calls. In the interest of simplicity though, I'm beginning to think that the right answer might be to just do something like termination.patch but to conditionalize crashing the pool on a pool configuration option. That way the behavior would no worse for your use case. Does that sound reasonable? -- ___ Python tracker <http://bugs.python.org/issue9205> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue9205] Parent process hanging in multiprocessing if children terminate unexpectedly
Greg Brockman added the comment: > I thought the EOF errors would take care of that, at least this has > been running in production on many platforms without that happening. There are a lot of corner cases here, some more pedantic than others. For example, suppose a child dies while holding the queue read lock... that wouldn't trigger an EOF error anywhere. Would a child being OOM-killed raise an EOF error? (It very well could, but I seem to recall that it does not.) I've said most of this before, but I still believe it's relevant, so here goes. In the context where I'm using this library, I'll often run jobs that should complete in O(10 minutes). I'll often start a job, realize I did something wrong and hit C-c (which could catch the workers anywhere). I've seen workers be OOM killed, silently dropping the tasks they had. As we've established, at the moment any of these failures results in a hang; I'd be very happy to see any sort of patch that improves my chances of seeing the program terminate in a finite amount of time. (And I'd be happiest if this is guaranteed.) It's possible that my use case isn't supported... but I just want to make sure I've made clear how I'm using the library. Does that make sense? > How would you shut down the pool then? A potential implementation is in termination.patch. Basically, try to shut down gracefully, but if you timeout, just give up and kill everything. > And why is that simpler? It's a lot less code (one could write an even shorter patch that doesn't try to do any additional graceful error handling), doesn't add a new monitor thread, doesn't add any more IPC mechanism, etc.. FWIW, I don't see any of these changes as bad, but I don't feel like I have a sense of how generally useful they would be. -- ___ Python tracker <http://bugs.python.org/issue9205> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue9334] argparse does not accept options taking arguments beginning with dash (regression from optparse)
Changes by Greg Brockman : -- nosy: +gdb ___ Python tracker <http://bugs.python.org/issue9334> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue9205] Parent process hanging in multiprocessing if children terminate unexpectedly
Greg Brockman added the comment: > You can't have a sensible default timeout, because the worker may be > processing something important... In my case, the jobs are either functional or idempotent anyway, so aborting halfway through isn't a problem. In general though, I'm not sure what kinds of use cases would tolerate silently-dropped jobs. And for example, if an OOM kill has just occurred, then you're already in a state where a job was unexpectedly terminated... you wouldn't be violating any more contracts by aborting. In general, I can't help but feel that the approach of "ignore errors and keep going" leads to rather unexpected bugs (and in this case, it leads to infinite hangs). But even in languages where errors are ignored by default (e.g. sh), there are mechanisms for turning on abort-on-error handlers (e.g. set -e). So my response is yes, you're right that there's no great default here. However, I think it'd be worth (at least) letting the user specify "if something goes wrong, then abort". Keep in mind that this will only happen in very exceptional circumstances anyway. > Not everything can be simple. Sure, but given the choice between a simple solution and a complex one, all else being equal the simple one is desirable. And in this case, the more complicated mechanism seems to introduce subtle race conditions and failures modes. Anyway, Jesse, it's been a while since we've heard anything from you... do you have thoughts on these issues? It would probably be useful to get a fresh opinion :). -- ___ Python tracker <http://bugs.python.org/issue9205> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue9205] Parent process hanging in multiprocessing if children terminate unexpectedly
Greg Brockman added the comment: Thanks for the comment. It's good to know what constraints we have to deal with. > we can not, however, change the API. Does this include adding optional arguments? -- ___ Python tracker <http://bugs.python.org/issue9205> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue9535] Pending signals are inherited by child processes
New submission from Greg Brockman : Upon os.fork(), pending signals are inherited by the child process. This can be demonstrated by pressing C-c in the middle of the following program: """ import os, sys, time, threading def do_fork(): while True: if not os.fork(): print 'hello from child' sys.exit(0) time.sleep(0.5) t = threading.Thread(target=do_fork) t.start() t.join() """ Right after os.fork(), each child will raise a KeyboardInterrupt exception. This behavior is different from the semantics of POSIX fork(), where child processes do not inherit their parents' pending signals. Attached is a first stab at a patch to fix this issue. Please let me know what you think! -- components: Extension Modules files: signals.patch keywords: patch messages: 113104 nosy: gdb priority: normal severity: normal status: open title: Pending signals are inherited by child processes type: behavior versions: Python 2.5, Python 2.6, Python 2.7 Added file: http://bugs.python.org/file18416/signals.patch ___ Python tracker <http://bugs.python.org/issue9535> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com