New submission from Christian Schou Oxvig:

I am experiencing some seemingly random stalls in my scientific simulations 
that make use of a multiprocessing.Pool for parallelization. It has been 
incredibly difficult for me to come up with an example that consistently 
reproduces the problem. It seems more or less random if and when the problem 
occurs. The below snippet is my best shot at something that has a good chance 
at hitting the problem. I know it is unfortunate to have PyTables in the mix 
but it is the only example I have been able to come up with that almost always 
hit the problem. I have been able to reproduce the problem (once!) by simply 
removing the with-statement (and thus PyTables) in the work function. However, 
by doing so (at least in my runs), the chance of hitting the problem almost 
vanishes. Also, judging from the output of the script, it seems that the cause 
of the problem is to be found in Python and not in PyTables.


import os
import multiprocessing as mp
import tables

_hdf_db_name = 'join_crash_test.hdf'
_lock = mp.Lock()


class File():

    def __init__(self, *args, **kwargs):
        self._args = args
        self._kwargs = kwargs

        if len(args) > 0:
            self._filename = args[0]
        else:
            self._filename = kwargs['filename']

    def __enter__(self):
        _lock.acquire()
        self._file = tables.open_file(*self._args, **self._kwargs)
        return self._file

    def __exit__(self, type, value, traceback):
        self._file.close()
        _lock.release()


def work(task):
    worker_num, iteration = task

    with File(_hdf_db_name, mode='a') as h5_file:
        h5_file.create_array('/', 'a{}_{}'.format(worker_num, iteration),
                             obj=task)
    print('Worker {} finished writing to HDF table at iteration {}'.format(
        worker_num, iteration))

    return (worker_num, iteration)

iterations = 10
num_workers = 24
maxtasks = 1

if os.path.exists(_hdf_db_name):
    os.remove(_hdf_db_name)

for iteration in range(iterations):
    print('Now processing iteration: {}'.format(iteration))
    tasks = zip(range(num_workers), num_workers * [iteration])
    try:
        print('Spawning worker pool')
        workers = mp.Pool(num_workers, maxtasksperchild=maxtasks)
        print('Mapping tasks')
        results = workers.map(work, tasks, chunksize=1)
    finally:
        print('Cleaning up')
        workers.close()
        print('Workers closed - joining')
        workers.join()
        print('Process terminated')


In most of my test runs, this example stalls at "Workers closed - joining" in 
one of the iterations. Hitting C-c and inspecting the stack shows that the main 
process is waiting for a single worker that never stops executing. I have 
tested the example on various combinations of the below listed operating 
systems and Python version.

Ubuntu 14.04.1 LTS
Ubuntu 14.04.3 LTS
ArchLinux (updated as of December 14, 2015)

Python 2.7.10 :: Anaconda 2.2.0 (64-bit)
Python 2.7.11 :: Anaconda 2.4.0 (64-bit)
Python 2.7.11 (Arch Linux 64-bit build)
Python 3.3.5 :: Anaconda 2.1.0 (64-bit)
Python 3.4.3 :: Anaconda 2.3.0 (64-bit)
Python 3.5.0 :: Anaconda 2.4.0 (64-bit)
Python 3.5.1 (Arch Linux 64-bit build)
Python 3.5.1 :: Anaconda 2.4.0 (64-bit)

It seems that some combinations are more likely to reproduce the problem than 
others. In particular, all the Python 3 builds reproduce the problem on almost 
every run, whereas I have not been able to reproduce the problem with the above 
example on any version of Python 2. I have, however, seen what appears to be 
the same problem in one of my simulations using Python 2.7.11. After 5 hours it 
stalled very close to the point of closing a Pool. Inspecting the HDF database 
holding the results showed that all but a single of the 4000 tasks submitted to 
the Pool finished. To me, this suggests that a single worker never finished 
executing.

The problem I am describing here might very well be related to issue9205 as 
well as issue22393. However, I am not sure how to verify if this is indeed the 
case or not.

----------
messages: 256684
nosy: chroxvi
priority: normal
severity: normal
status: open
title: Worker stall in multiprocessing.Pool
type: behavior
versions: Python 2.7, Python 3.3, Python 3.4, Python 3.5

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue25906>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to