[issue9205] Parent process hanging in multiprocessing if children terminate unexpectedly

2010-08-13 Thread Greg Brockman

Greg Brockman  added the comment:

I'll take another stab at this.  In the attachment (assign-tasks.patch), I've 
combined a lot of the ideas presented on this issue, so thank you both for your 
input.  Anyway:

- The basic idea of the patch is to record the mapping of tasks to workers.  
I've added a protocol between the parent process and the workers that allows 
this to happen without adding a race condition between recording the task and 
the child dying.
- If a child unexpectedly dies, the worker_handler pretends that all of the 
jobs currently assigned to it raised a RuntimeError.  (Multiple jobs can be 
assigned to a single worker if the result handler is being slow.)
- The guarantee I try to provide is that each job will be started at most once. 
 There is enough information to instead ensure that each job is run exactly 
once, but in general whether that's acceptable or useful is really only known 
at the application level.

Some notes:
- I haven't implemented this for approach for the ThreadPool yet.
- The test suite runs but occasionally hangs on shutting down the pool in Ask's 
tests in multiprocessing-tr...@82502-termination-trackjobs.patch.  My 
experiments seem to indicate this is due to a worker dying while holding a 
queue lock.  So I think a next step is to deal with workers dying while holding 
a queue lock, although this seems unlikely in practice.  I have some ideas as 
to how you could fix this, if we decide it's worth trying.

Anyway, please let me know what you think of this approach/sample 
implementation.  If we decide that this seems promising, I'd be happy to built 
it out further.

--
Added file: http://bugs.python.org/file18513/assign-tasks.patch

___
Python tracker 
<http://bugs.python.org/issue9205>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9205] Parent process hanging in multiprocessing if children terminate unexpectedly

2010-08-20 Thread Greg Brockman

Greg Brockman  added the comment:

Thanks for looking at it!  Basically this patch requires the parent process to 
be able to send a message to a particular worker.  As far as I can tell, the 
existing queues allow the children to send a message to the parent, or the 
parent to send a message to one child (whichever happens to win the race; not a 
particular one).

I don't love introducing one queue per child either, although I don't have a 
sense of how much overhead that would add.

Does the problem make sense/do you have any ideas for an alternate solution?

--

___
Python tracker 
<http://bugs.python.org/issue9205>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue8296] multiprocessing.Pool hangs when issuing KeyboardInterrupt

2010-08-26 Thread Greg Brockman

Changes by Greg Brockman :


--
nosy: +gdb

___
Python tracker 
<http://bugs.python.org/issue8296>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9205] Parent process hanging in multiprocessing if children terminate unexpectedly

2010-08-27 Thread Greg Brockman

Greg Brockman  added the comment:

Hmm, a few notes.  I have a bunch of nitpicks, but those can wait for a later 
iteration.  (Just one style nit: I noticed a few unneeded whitespace changes... 
please try not to do that, as it makes the patch harder to read.)

- Am I correct that you handle a crashed worker by aborting all running jobs?  
If so:
  - Is this acceptable for your use case?  I'm fine with it, but had been under 
the impression that we would rather this did not happen.
  - If you're going to the effort of ACKing, why not record the mapping of 
tasks to workers so you can be more selective in your termination?  Otherwise, 
what does the ACKing do towards fixing this particular issue?
- I think in the final version you'd need to introduce some interthread 
locking, because otherwise you're going to have weird race conditions.  I 
haven't thought too hard about whether you can get away with just catching 
unexpected exceptions, but it's probably better to do the locking.
- I'm getting hangs infrequently enough to make debugging annoying, and I don't 
have time to track down the bug right now.  Why don't you strip out any changes 
that are not needed (e.g. AFAICT, the ACK logic), make sure there aren't weird 
race conditions, and if we start converging on a patch that looks right from a 
high level we can try to make it work on all the corner cases?

--

___
Python tracker 
<http://bugs.python.org/issue9205>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9205] Parent process hanging in multiprocessing if children terminate unexpectedly

2010-08-27 Thread Greg Brockman

Greg Brockman  added the comment:

Ah, you're right--sorry, I had misread your code.  I hadn't noticed
the usage of the worker_pids.  This explains what you're doing with
the ACKs.  Now, the problem is, I think doing it this way introduces
some races (which is why I introduced the ACK from the task handler in
my most recent patch).  What happens if:
- A worker removes a job from the queue and is killed before sending an ACK.
- A worker removes a job from the queue, sends an ACK, and then is
killed.  Due to bad luck with the scheduler, the parent cleans the
worker before the parent has recorded the worker pid.

You're now reading from self._cache in one thread but writing it in
another.  What happens if a worker sends a result and then is killed?
Again, I haven't thought too hard about what will happen here, so if
you have a correctness argument for why it's safe as-is I'd be happy
to hear it.

Also, I just noted that your current way of dealing with child deaths
doesn't play well with the maxtasksperchild variable.  In particular,
try running:
"""
import multiprocessing
def foo(x):
  return x
multiprocessing.Pool(1, maxtasksperchild=1).map(foo, [1, 2, 3, 4])
"""
(This should be an easy fix.)

--

___
Python tracker 
<http://bugs.python.org/issue9205>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4106] multiprocessing occasionally spits out exception during shutdown

2010-07-08 Thread Greg Brockman

Greg Brockman  added the comment:

For what it's worth, I think I have a simpler reproducer of this issue.  Using 
freshly-compiled python-from-trunk (as well as multiprocessing-from-trunk), I 
get tracebacks from the following about 30% of the time:

"""
import multiprocessing, time   
def foo(x):
 time.sleep(3) 
multiprocessing.Pool(1).apply(foo, [1])
"""

My tracebacks are of the form:
"""
Exception in thread Thread-1 (most likely raised during interpreter shutdown):
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/threading.py", line 530, in __bootstrap_inner
  File "/usr/local/lib/python2.7/threading.py", line 483, in run
  File "/usr/local/lib/python2.7/multiprocessing/pool.py", line 272, in 
_handle_workers
: 'NoneType' object is not callable
"""

--
nosy: +gdb

___
Python tracker 
<http://bugs.python.org/issue4106>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4106] multiprocessing occasionally spits out exception during shutdown

2010-07-08 Thread Greg Brockman

Greg Brockman  added the comment:

I'm on Ubuntu 10.04, 64 bit.

--

___
Python tracker 
<http://bugs.python.org/issue4106>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9205] Parent process hanging in multiprocessing if children terminate unexpectedly

2010-07-08 Thread Greg Brockman

New submission from Greg Brockman :

I have recently begun using multiprocessing for a variety of batch
jobs.  It's a great library, and it's been quite useful.  However, I have been 
bitten several times by situations where a worker process in a Pool will 
unexpectedly die, leaving multiprocessing hanging in a wait.  A simple example 
of this is produced by the following:
"""
#!/usr/bin/env python
import multiprocessing, sys
def foo(x):
  sys.exit(1)
multiprocessing.Pool(1).apply(foo, [1])
"""
The child will exit and the parent will hang forever.  A similar occurrence 
happens if one pushes C-c while a child process is running (this special case 
is noted in http://bugs.python.org/issue8296) or killed by a signal.

Attached is a patch to handle unexpected terminations of children
processes and prevent the parent process from hanging.  A test case is 
included.  (Developed and tested on 64-bit Ubuntu.)  Please let me know what 
you think.  Thanks!

--
components: Library (Lib)
files: termination.patch
keywords: patch
messages: 109585
nosy: gdb
priority: normal
severity: normal
status: open
title: Parent process hanging in multiprocessing if children terminate 
unexpectedly
type: behavior
versions: Python 2.6, Python 2.7
Added file: http://bugs.python.org/file17905/termination.patch

___
Python tracker 
<http://bugs.python.org/issue9205>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9207] multiprocessing occasionally spits out exception during shutdown

2010-07-08 Thread Greg Brockman

New submission from Greg Brockman :

On Ubuntu 10.04, using freshly-compiled python-from-trunk (as well as 
multiprocessing-from-trunk), I get tracebacks from the following about 30% of 
the time:

"""
import multiprocessing, time   
def foo(x):
 time.sleep(3) 
multiprocessing.Pool(1).apply(foo, [1])
"""

My tracebacks are of the form:
"""
Exception in thread Thread-1 (most likely raised during interpreter shutdown):
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/threading.py", line 530, in __bootstrap_inner
  File "/usr/local/lib/python2.7/threading.py", line 483, in run
  File "/usr/local/lib/python2.7/multiprocessing/pool.py", line 272, in 
_handle_workers
: 'NoneType' object is not callable
"""

This information was originally posted to http://bugs.python.org/issue4106.

--
components: Library (Lib)
messages: 109588
nosy: gdb
priority: normal
severity: normal
status: open
title: multiprocessing occasionally spits out exception during shutdown
type: behavior
versions: Python 2.6, Python 2.7

___
Python tracker 
<http://bugs.python.org/issue9207>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue4106] multiprocessing occasionally spits out exception during shutdown

2010-07-08 Thread Greg Brockman

Greg Brockman  added the comment:

Sure thing.  See http://bugs.python.org/issue9207.

--

___
Python tracker 
<http://bugs.python.org/issue4106>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9207] multiprocessing occasionally spits out exception during shutdown (_handle_workers)

2010-07-08 Thread Greg Brockman

Greg Brockman  added the comment:

That's likely a mistake on my part.  I'm not observing this using the stock 
version of multiprocessing on my Ubuntu machine(after running O(100) times).  I 
do, however, observe it when using either python2.7 or python2.6 with 
multiprocessing-from-trunk, if that's interesting.

I'm not really sure what the convention is here; should this be filed just 
under Python 2.7?  Thanks.

--

___
Python tracker 
<http://bugs.python.org/issue9207>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9207] multiprocessing occasionally spits out exception during shutdown (_handle_workers)

2010-07-08 Thread Greg Brockman

Greg Brockman  added the comment:

No, I'm not using the Google code backport.

To be clear, I've tried testing this with two versions of multiprocessing:
- multiprocessing-from-trunk (r82645): I get these exceptions with ~40% 
frequency
- multiprocessing from Ubuntu 10.04 (version 0.70a1): No such exceptions 
observed

Out of curiosity, I did just try this with the processing library (version 
0.52) on a 64-bit Debian Lenny box, and did not observe these exceptions.

Hope that's useful!

--

___
Python tracker 
<http://bugs.python.org/issue9207>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9207] multiprocessing occasionally spits out exception during shutdown (_handle_workers)

2010-07-08 Thread Greg Brockman

Greg Brockman  added the comment:

> Wait - so, you are pulling svn trunk, compiling and running your test 
> with the built python executable?
Yes.  I initially observed this issue while using 10.04's Python (2.6.5), but 
wanted to make sure it wasn't fixed by using a newer interpreter.

> I'm not following the "multiprocessing-from-trunk" distinction unless > 
> you're picking the module out of the tree / compiling it and then 
> moving it into some other install. I might be being overly dense.
Initially I was doing exactly that.  (Some context: I was working on a patch to 
fix a different multiprocessing issue, and figured I may as well write my patch 
against the most recent version of the library.)  Note that I was using Lucid's 
_multiprocessing, so there was no compilation involved.

> You're running your test with cd src/tree/ && ./python  - > right?
What... is src/tree?  If it's what you're asking, I am running the 
freshly-compiled python interpreter, and it does seem to be using the relevant 
modules out of trunk:
>>> import threading; threading.__file__
'/usr/local/lib/python2.7/threading.pyc'
>>> import multiprocessing; multiprocessing.__file__
'/usr/local/lib/python2.7/multiprocessing/__init__.pyc'
>>> import _multiprocessing; _multiprocessing.__file__
'/usr/local/lib/python2.7/lib-dynload/_multiprocessing.so'

When running with 2.6, all modules are whatever's available for 10.04 except 
for the multiprocessing that I took from trunk:
>>> import threading; threading.__file__
'/usr/lib/python2.6/threading.pyc'
>>> import multiprocessing; multiprocessing.__file__
'multiprocessing/__init__.pyc'
>>> import _multiprocessing; _multiprocessing.__file__
'/usr/lib/python2.6/lib-dynload/_multiprocessing.so'

> Also, what, if any, compile flags are you passing to the python build?
I just ran ./configure && make && make install

Sorry about the confusion--let me know if you'd like additional information.  I 
can test on other platforms/with other configurations if it would be useful.

--

___
Python tracker 
<http://bugs.python.org/issue9207>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9207] multiprocessing occasionally spits out exception during shutdown (_handle_workers)

2010-07-08 Thread Greg Brockman

Greg Brockman  added the comment:

Yeah, I've just taken a checkout from trunk, ran './configure && make && make 
install', and reproduced on:

- Ubuntu 10.04 32-bit
- Ubuntu 9.04 32-bit

--

___
Python tracker 
<http://bugs.python.org/issue9207>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9207] multiprocessing occasionally spits out exception during shutdown (_handle_workers)

2010-07-08 Thread Greg Brockman

Greg Brockman  added the comment:

With the line commented out, I no longer see any exceptions.

Although, if I understand what's going on, there still a (much rarer) 
possibility of an exception, right?  I guess in the common case, the 
worker_handler is in the sleep when shutdown begins.  But if it happens to be 
in the in the _maintain_pool step, would you still get these exceptions?

--

___
Python tracker 
<http://bugs.python.org/issue9207>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9207] multiprocessing occasionally spits out exception during shutdown (_handle_workers)

2010-07-09 Thread Greg Brockman

Greg Brockman  added the comment:

Think http://www.mail-archive.com/python-l...@python.org/msg282114.html is 
relevant?

--

___
Python tracker 
<http://bugs.python.org/issue9207>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9205] Parent process hanging in multiprocessing if children terminate unexpectedly

2010-07-10 Thread Greg Brockman

Greg Brockman  added the comment:

Cool, thanks.  I'll note that with this patch applied, using the test program 
from 9207 I consistently get the following exception:
"""
Exception in thread Thread-1 (most likely raised during interpreter shutdown):
Traceback (most recent call last):
  File "/usr/lib/python2.6/threading.py", line 532, in __bootstrap_inner
  File "/usr/lib/python2.6/threading.py", line 484, in run
  File "/home/gdb/repositories/multiprocessing/pool.py", line 312, in 
_handle_workers
  File "/home/gdb/repositories/multiprocessing/pool.py", line 190, in 
_maintain_pool
  File "/home/gdb/repositories/multiprocessing/pool.py", line 158, in 
_join_exited_workers
: 'NoneType' object is not callable
"""

This is line 148 in the unpatched source, namely the 
'reversed(range(len(self._pool)))' line of _join_exited_workers.  Looks like 
the same issue, where instead reversed/range/len have been set to None.

So I think by changing how much time the worker_handler spends in various 
functions, I've made it possible (or just more likely?) that if we lose the 
race with interpreter shutdown the worker_handler will be in the middle of 
_join_exited_workers.  This may mean that someone should keep around a local 
reference to reversed/range/len... not sure if there's a better solution.

--

___
Python tracker 
<http://bugs.python.org/issue9205>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9205] Parent process hanging in multiprocessing if children terminate unexpectedly

2010-07-10 Thread Greg Brockman

Greg Brockman  added the comment:

What about just catching the exception?  See e.g. the attached patch.  
(Disclaimer: not heavily tested).

--
Added file: http://bugs.python.org/file17934/shutdown.patch

___
Python tracker 
<http://bugs.python.org/issue9205>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9207] multiprocessing occasionally spits out exception during shutdown (_handle_workers)

2010-07-12 Thread Greg Brockman

Greg Brockman  added the comment:

With pool.py:272 commented out, running about 50k iterations, I saw 4 
tracebacks giving an exception on pool.py:152.  So this seems to imply the race 
does exist (i.e. that the thread is in _maintain_pool rather than time.sleep 
when shutdown begins).  It looks like the _maintain_pool run takes O(10^-4)s, 
so it's not surprising the error is so rare.

That being said, the patch I submitted in issue 9205 should handle this case as 
well.

--

___
Python tracker 
<http://bugs.python.org/issue9207>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9205] Parent process hanging in multiprocessing if children terminate unexpectedly

2010-07-12 Thread Greg Brockman

Greg Brockman  added the comment:

Thanks much for taking a look at this!

> why are you terminating the second pass after finding a failed 
> process?
Unfortunately, if you've lost a worker, you are no longer guaranteed that cache 
will eventually be empty.  In particular, you may have lost a task, which could 
result in an ApplyResult waiting forever for a _set call.

More generally, my chief assumption that went into this is that the unexpected 
death of a worker process is unrecoverable.  It would be nice to have a better 
workaround than just aborting everything, but I couldn't see a way to do that.

> Unpickleable errors and other errors occurring in the worker body are
> not exceptional cases, at least not now that the pool is supervised
> by _handle_workers.
I could be wrong, but that's not what my experiments were indicating.  In 
particular, if an unpickleable error occurs, then a task has been lost, which 
means that the relevant map, apply, etc. will wait forever for completion of 
the lost task.

> I think the result should be set also in this case, so the user can
> inspect the exception after the fact.
That does sound useful.  Although, how can you determine the job (and the value 
of i) if it's an unpickleable error?  It would be nice to be able to retrieve 
job/i without having to unpickle the rest.

> For shutdown.patch, I thought this only happened in the worker 
> handler, but you've enabled this for the result handler too? I don't 
> care about the worker handler, but with the result handler I'm 
> worried that I don't know what ignoring these exceptions actually 
> means.
You have a good point.  I didn't think about the patch very hard.  I've only 
seen these exceptions from the worker handler, but AFAICT there's no guarantee 
that bad luck with the scheduler wouldn't result in the same problem in the 
result handler.  One option would be to narrow the breadth of the exceptions 
caught by _make_shutdown_safe (do we need to catch anything but TypeErrors?).  
Another option would be to enable only for the worker handler.  I don't have a 
particularly great sense of what the Right Thing to do here is.

--

___
Python tracker 
<http://bugs.python.org/issue9205>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9205] Parent process hanging in multiprocessing if children terminate unexpectedly

2010-07-12 Thread Greg Brockman

Greg Brockman  added the comment:

> For processes disappearing (if that can at all happen), we could solve
> that by storing the jobs a process has accepted (started working on),
> so if a worker process is lost, we can mark them as failed too.
Sure, this would be reasonable behavior.  I had considered it but decided it as 
a larger change than I wanted to make without consulting the devs.

> I was already working on this issue last week actually, and I managed
> to do that in a way that works well enough (at least for me):
If I'm reading this right, you catch the exception upon pickling the result (at 
which point you have the job/i information already; totally reasonable).  I'm 
worried about the case of unpickling the task failing.  (Namely, the "task = 
get()" line of the "worker" method.)  Try running the following:
"""
#!/usr/bin/env python
import multiprocessing
p = multiprocessing.Pool(1)
def foo(x):
  pass
p.apply(foo, [1])
"""
And if "task = get()" fails, then the worker doesn't know what the relevant 
job/i values are.

Anyway, so I guess the question that is forming in my mind is, what sorts of 
errors do we want to handle, and how do we want to handle them?  My answer is 
I'd like to handle all possible errors with some behavior that is not "hang 
forever".  This includes handling children processes dying by signals or 
os._exit, raising unpickling errors, etc.

I believe my patch provides this functionality.  By adding the extra mechanism 
that you've written/proposed, we can improve the error handling in specific 
recoverable cases (which probably constitute the vast majority of real-world 
cases).

--

___
Python tracker 
<http://bugs.python.org/issue9205>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9205] Parent process hanging in multiprocessing if children terminate unexpectedly

2010-07-13 Thread Greg Brockman

Greg Brockman  added the comment:

> What kind of errors are you having that makes the get() call fail?
Try running the script I posted.  It will fail with an AttributeError (raised 
during unpickling) and hang.

I'll note that the particular issues that I've run into in practice are:
- OOM kill destroying my workers but leaving the parent silently waiting
- KeyboardInterrupting the workers, and then having the parent hang

This AttributeError problem is one that I discovered while generating test 
cases for the patch.

--

___
Python tracker 
<http://bugs.python.org/issue9205>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9244] multiprocessing.pool: Worker crashes if result can't be encoded

2010-07-13 Thread Greg Brockman

Greg Brockman  added the comment:

This looks pretty reasonable to my untrained eye.  I successfully applied and 
ran the test suite.

To be clear, the errback change and the unpickleable result change are actually 
orthogonal, right?  Anyway, I'm not really familiar with the protocol here, but 
assuming that you're open to code review:

> -def apply_async(self, func, args=(), kwds={}, callback=None):
> +def apply_async(self, func, args=(), kwds={}, callback=None,
> +error_callback=None):
>  '''
>  Asynchronous equivalent of `apply()` builtin
>  '''
>  assert self._state == RUN
> -result = ApplyResult(self._cache, callback)
> +result = ApplyResult(self._cache, callback, error_callback)
>  self._taskqueue.put(([(result._job, None, func, args, kwds)], None))
>  return result
Sure.  Why not add an error_callback for map_async as well?

> -def __init__(self, cache, callback):
> +def __init__(self, cache, callback, error_callback=None):
>  self._cond = threading.Condition(threading.Lock())
>  self._job = job_counter.next()
>  self._cache = cache
>  self._ready = False
>  self._callback = callback
> +self._errback = error_callback
>  cache[self._job] = self
Any reason you chose to use a different internal name (errback versus 
error_callback)?   It seems cleaner to me to be consistent about the name.

>  def sqr(x, wait=0.0):
>  time.sleep(wait)
>  return x*x
> +
>  class _TestPool(BaseTestCase):
>  def test_apply(self):
> @@ -1020,6 +1021,7 @@ class _TestPool(BaseTestCase):
>  self.assertEqual(get(), 49)
>  self.assertTimingAlmostEqual(get.elapsed, TIMEOUT1)
>  
> +
>  def test_async_timeout(self):
In general, I'm wary of nonessential whitespace changes... did you mean to 
include these?

> +scratchpad = [None]
> +def errback(exc):
> +scratchpad[0] = exc
> +
> +res = p.apply_async(raising, error_callback=errback)
> +self.assertRaises(KeyError, res.get)
> +self.assertTrue(scratchpad[0])
> +self.assertIsInstance(scratchpad[0], KeyError)
> +
> +p.close()
Using "assertTrue" seems misleading.  "assertIsNotNone" is what you really 
mean, right?  Although, I believe that's redundant, since presumably 
self.assertIsInstance(None, KeyError) will error out anyway (I haven't verified 
this).


> +def test_unpickleable_result(self):
> +from multiprocessing.pool import MaybeEncodingError
> +p = multiprocessing.Pool(2)
> +
> +# Make sure we don't lose pool processes because of encoding errors.
> +for iteration in xrange(20):
> +
> +scratchpad = [None]
> +def errback(exc):
> +scratchpad[0] = exc
> +
> +res = p.apply_async(unpickleable_result, error_callback=errback)
> +self.assertRaises(MaybeEncodingError, res.get)
> +wrapped = scratchpad[0]
> +self.assertTrue(wrapped)
Again, assertTrue is probably not what you want, and is probably redundant.
> +self.assertIsInstance(scratchpad[0], MaybeEncodingError)
Why use scratchpad[0] rather than wrapped?
> +self.assertIsNotNone(wrapped.exc)
> +self.assertIsNotNone(wrapped.value)
Under what circumstances would these be None?  (Perhaps you want wrapped.exc != 
'None'?)  The initializer for MaybeEncodingError enforces the invariant that 
exc/value are strings, right?


> +
>  class _TestPoolWorkerLifetime(BaseTestCase):
>  
>  ALLOWED_TYPES = ('processes', )
Three line breaks there seems excessive.

--

___
Python tracker 
<http://bugs.python.org/issue9244>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9205] Parent process hanging in multiprocessing if children terminate unexpectedly

2010-07-13 Thread Greg Brockman

Greg Brockman  added the comment:

While looking at your patch in issue 9244, I realized that my code fails to 
handle an unpickleable task, as in:
"""
#!/usr/bin/env python
import multiprocessing
foo = lambda x: x
p = multiprocessing.Pool(1)
p.apply(foo, [1])
"""
This should be fixed by the attached pickling_error.patch (independent of my 
other patches).

--
Added file: http://bugs.python.org/file17987/pickling_error.patch

___
Python tracker 
<http://bugs.python.org/issue9205>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9205] Parent process hanging in multiprocessing if children terminate unexpectedly

2010-07-14 Thread Greg Brockman

Greg Brockman  added the comment:

Before I forget, looks like we also need to deal with the result from a worker 
being un-unpickleable:
"""
#!/usr/bin/env python
import multiprocessing
def foo(x):
  global bar
  def bar(x):
pass
  return bar
p = multiprocessing.Pool(1)
p.apply(foo, [1])
"""

This shouldn't require much more work, but I'll hold off on submitting a patch 
until we have a better idea of where we're going in this arena.

> Instead of restarting crashed worker processes it will simply bring down
> the pool, right?
Yep.  Again, as things stand, once you've lost an worker, you've lost a task, 
and you can't really do much about it.  I guess that depends on your 
application though... is your use-case such that you can lose a task without it 
mattering?  If tasks are idempotent, one could have the task handler resubmit 
them, etc..  But really, thinking about the failure modes I've seen (OOM 
kills/user-initiated interrupt) I'm not sure under what circumstances I'd like 
the pool to try to recover.

The idea of recording the mapping of tasks -> workers seems interesting.  
Getting all of the corner cases could be hard (e.g. making removing a task from 
the queue and recording which worker did the removing atomic, detecting if the 
worker crashed while still holding the queue lock) and doing this would require 
extra mechanism.  This feature does seem to be useful for pools running many 
different jobs, because that way a crashed worker need only terminate one job.

Anyway, I'd be curious to know more about the kinds of crashes you've 
encountered from which you'd like to be able to recover.  Is it just 
Unpickleable exceptions, or are there others?

--

___
Python tracker 
<http://bugs.python.org/issue9205>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9205] Parent process hanging in multiprocessing if children terminate unexpectedly

2010-07-15 Thread Greg Brockman

Greg Brockman  added the comment:

>> Before I forget, looks like we also need to deal with the
>> result from a worker being un-unpickleable:
>This is what my patch in bug 9244 does...
Really?  I could be misremembering, but I believe you deal with the case of the 
result being unpickleable.  I.e. you deal with the put(result) failing, but not 
the get() in the result handler.  Does my sample program work with your patch 
applied?

> while state != TERMINATE:
>  result = get(timeout=1)
>  if all_processes_dead():
>  break;
Will this sort of approach work with the supervisor, which continually respawns 
workers?

> user-initiated interrupts, this is very important to recover from,
> think of some badly written library code suddenly raising SystemExit,
> this shouldn't terminate other jobs, and it's probably easy to 
> recover from, so why shouldn't it try?
To be clear, in this case I was thinking of KeyboardInterrupts.

I'll take a look at your patch in a bit.  From our differing use-cases, I do 
think it could make sense as a configuration option, but where it probably 
belongs is on the wait() call of ApplyResult.

--

___
Python tracker 
<http://bugs.python.org/issue9205>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9205] Parent process hanging in multiprocessing if children terminate unexpectedly

2010-07-15 Thread Greg Brockman

Greg Brockman  added the comment:

Actually, the program you demonstrate is nonequivalent to the one I posted.  
The one I posted pickles just fine because 'bar' is a global name, but doesn't 
unpickle because it doesn't exist in the parent's namespace.  (See 
http://docs.python.org/library/pickle.html#what-can-be-pickled-and-unpickled.)  
Although, if you're able to run my test program verbatim, then it's entirely 
possible I'm just missing something.

Anyway, I do think that adding a 'worker_missing_callback' could work.  You'd 
still have to make sure the ApplyResult (or MapResult) can crash the pool if it 
deems necessary though.

--

___
Python tracker 
<http://bugs.python.org/issue9205>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9205] Parent process hanging in multiprocessing if children terminate unexpectedly

2010-07-15 Thread Greg Brockman

Greg Brockman  added the comment:

Started looking at your patch.  It seems to behave reasonably, although it 
still doesn't catch all of the failure cases.  In particular, as you note, 
crashed jobs won't be noticed until the pool shuts down... but if you make a 
blocking call such as in the following program, you'll get a hang:
"""
#!/usr/bin/env python
import multiprocessing, os, signal
def foo(x):
  os.kill(os.getpid(), signal.SIGKILL)
multiprocessing.Pool(1).apply(foo, [1])
"""

The tests also occasionally hang in e.g.
test_job_killed_by_signal (__main__.WithProcessesTestPoolSupervisor) ...

--

___
Python tracker 
<http://bugs.python.org/issue9205>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9205] Parent process hanging in multiprocessing if children terminate unexpectedly

2010-07-20 Thread Greg Brockman

Greg Brockman  added the comment:

At first glance, looks like there are a number of sites where you don't change 
the blocking calls to non-blocking calls (e.g. get()).  Almost all of the 
get()s have the potential to be called when there is no possibility for them to 
terminate.

I might recommend referring to my original termination.patch... I believe I 
tracked down the majority of such blocking calls.

In the interest of simplicity though, I'm beginning to think that the right 
answer might be to just do something like termination.patch but to 
conditionalize crashing the pool on a pool configuration option.  That way the 
behavior would no worse for your use case.  Does that sound reasonable?

--

___
Python tracker 
<http://bugs.python.org/issue9205>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9205] Parent process hanging in multiprocessing if children terminate unexpectedly

2010-07-21 Thread Greg Brockman

Greg Brockman  added the comment:

> I thought the EOF errors would take care of that, at least this has
> been running in production on many platforms without that happening.
There are a lot of corner cases here, some more pedantic than others.  For 
example, suppose a child dies while holding the queue read lock... that 
wouldn't trigger an EOF error anywhere.  Would a child being OOM-killed raise 
an EOF error?  (It very well could, but I seem to recall that it does not.)

I've said most of this before, but I still believe it's relevant, so here goes. 
 In the context where I'm using this library, I'll often run jobs that should 
complete in O(10 minutes).  I'll often start a job, realize I did something 
wrong and hit C-c (which could catch the workers anywhere).  I've seen workers 
be OOM killed, silently dropping the tasks they had.  As we've established, at 
the moment any of these failures results in a hang; I'd be very happy to see 
any sort of patch that improves my chances of seeing the program terminate in a 
finite amount of time.  (And I'd be happiest if this is guaranteed.)

It's possible that my use case isn't supported... but I just want to make sure 
I've made clear how I'm using the library.  Does that make sense?

> How would you shut down the pool then?
A potential implementation is in termination.patch.  Basically, try to shut 
down gracefully, but if you timeout, just give up and kill everything.

> And why is that simpler?
It's a lot less code (one could write an even shorter patch that doesn't try to 
do any additional graceful error handling), doesn't add a new monitor thread, 
doesn't add any more IPC mechanism, etc..  FWIW, I don't see any of these 
changes as bad, but I don't feel like I have a sense of how generally useful 
they would be.

--

___
Python tracker 
<http://bugs.python.org/issue9205>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9334] argparse does not accept options taking arguments beginning with dash (regression from optparse)

2010-07-22 Thread Greg Brockman

Changes by Greg Brockman :


--
nosy: +gdb

___
Python tracker 
<http://bugs.python.org/issue9334>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9205] Parent process hanging in multiprocessing if children terminate unexpectedly

2010-07-27 Thread Greg Brockman

Greg Brockman  added the comment:

> You can't have a sensible default timeout, because the worker may be
> processing something important...
In my case, the jobs are either functional or idempotent anyway, so aborting 
halfway through isn't a problem.  In general though, I'm not sure what kinds of 
use cases would tolerate silently-dropped jobs.  And for example, if an OOM 
kill has just occurred, then you're already in a state where a job was 
unexpectedly terminated... you wouldn't be violating any more contracts by 
aborting.

In general, I can't help but feel that the approach of "ignore errors and keep 
going" leads to rather unexpected bugs (and in this case, it leads to infinite 
hangs).  But even in languages where errors are ignored by default (e.g. sh), 
there are mechanisms for turning on abort-on-error handlers (e.g. set -e).

So my response is yes, you're right that there's no great default here.  
However, I think it'd be worth (at least) letting the user specify "if 
something goes wrong, then abort".  Keep in mind that this will only happen in 
very exceptional circumstances anyway.

> Not everything can be simple.
Sure, but given the choice between a simple solution and a complex one, all 
else being equal the simple one is desirable.  And in this case, the more 
complicated mechanism seems to introduce subtle race conditions and failures 
modes.

Anyway, Jesse, it's been a while since we've heard anything from you... do you 
have thoughts on these issues?  It would probably be useful to get a fresh 
opinion :).

--

___
Python tracker 
<http://bugs.python.org/issue9205>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9205] Parent process hanging in multiprocessing if children terminate unexpectedly

2010-07-27 Thread Greg Brockman

Greg Brockman  added the comment:

Thanks for the comment.  It's good to know what constraints we have to deal 
with.

> we can not, however, change the API.
Does this include adding optional arguments?

--

___
Python tracker 
<http://bugs.python.org/issue9205>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9535] Pending signals are inherited by child processes

2010-08-06 Thread Greg Brockman

New submission from Greg Brockman :

Upon os.fork(), pending signals are inherited by the child process.  This can 
be demonstrated by pressing C-c in the middle of the
following program:

"""
import os, sys, time, threading
def do_fork():
while True:
if not os.fork():
print 'hello from child'
sys.exit(0)
time.sleep(0.5)
t = threading.Thread(target=do_fork)
t.start()
t.join()
"""
Right after os.fork(), each child will raise a KeyboardInterrupt exception.

This behavior is different from the semantics of POSIX fork(), where child 
processes do not inherit their parents' pending signals.

Attached is a first stab at a patch to fix this issue.  Please let me know what 
you think!

--
components: Extension Modules
files: signals.patch
keywords: patch
messages: 113104
nosy: gdb
priority: normal
severity: normal
status: open
title: Pending signals are inherited by child processes
type: behavior
versions: Python 2.5, Python 2.6, Python 2.7
Added file: http://bugs.python.org/file18416/signals.patch

___
Python tracker 
<http://bugs.python.org/issue9535>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com