Hi guys, I tend to ramble, and I am afraid none of you busy experts will bother reading my long post, so I will try to summarize it first:
1. I have a script that processes ~10GB of data daily, and runs for a long time that I need to parallelize on a multicpu/multicore system. I am trying to decide on a module/toolkit that would help me create a multiprocessing solution but there are so many of them that I can't decide what to use. I am looking for a cross platform solution. Although right now it has to work in windows first, so many of the fork based modules are out. I am hoping people with experience using any of these would chime in with tips. The main thing I would look for in a toolkit is maturity and no extra dependencies. Plus a wide user community is always good. POSH/parallelpython/mpi4py/pyPar/Kamaelia/Twisted I am so confused :( 2. The processing involves multiple steps that each input file has to go through. I am trying to decide between a batch mode design and a pipelined design for concurrency. In the batched design, all files will be processed on one processing step(in parallel) before the next step is started. In a pipelined design, each file will be taken through all steps to the end. So multiple files will be in parallel pipelines at the same time. I can't decide which is better. I guess I am asking for experienced eyes to take a look at the alternatives, for things that I, making my very first concurrent design, won't see. DETAILS: I have been trying to choose a design for this project but am striken by my usual case of analysis paralysis. I had decided to learn Python about 3 weeks ago specifically for this project, as it needed parsing and text processing, not realizing that I would need concurrency. I am having the same trouble in deciding which parser generator to use, but I will ask about parsing in a separate thread to keep this focused. It was slow, so I tried to run a multithreaded version, naively expecting a 2x speedup. I barely got a 5% improvement and only then learned about the GIL. I guess I still haven't got too much time invested in this, so I can still switch to another language. I am not sure which other scripting languages have real multithreading? Perl? But I had chosen Python over Perl for readability and maintainability and am not ready to give that up yet. I know about stackless/Ironpython/Jython but I want to stick to CPython. So I am going to try to figure this out. Even after deciding to go for a SMP solution, I still don't know which toolkit to use. The subprocess module should allow spawning new processes, but I am not sure how to get status/error codes back from those? I guess this is why people made those parallel processing modules that might help by taking care of these things. I think my application is fairly simple and should be easy to SMP. THE TASK: About 800+ 10-15MB files are generated daily that need to be processed. The processing consists of different steps that the files must go through: -Uncompress -FilterA -FilterB -Parse -Possibly compress parsed files for archival All files have to be run through each of the two filters. The two filters are independent of each other and produce output files that need separate parsers. So they can in fact run in parallel, and so can the subsequent parsers. Furthermore, multiple files can be running in parallel inside each step. Eg. 4 files being uncompressed at the same time. I am using the python library for uncompressing and will be doing the parsing in Python too. But the two filters are external console programs that I spawn in the system shell with subprocess.call(). I guess I can forget about communicating with those? The first method that came to mind was to finish each step on all files before going to the next. So all files are uncompressed first, using multiple processes in parallel. Then all files are filtered in parallel, etc. I guess I would need some sort of queuing system here, to submit files to the CPUs properly? The other way could be to have each individual file run through all the steps and have multiple such "pipelines" running simultaneously in parallel. It feels like this method will lose cache performance because all the code for all the steps will be loaded at the same time, but I am not sure if I should be worrying about that. This will have the advantage of "Fast First-Out" which means that something waiting for the results of processing won't have to wait till the very end. They can start receiving data incrementally from the start(kind of streaming?). Pipelined mode may also help to rerun an individual file quickly in case it had an error. So whats the better method? EVALUATIONS: POSH - Doesn't seem mature, was supposed to be proof of concept only. People have reported Bugs/Problems using it. POSIX Only. delegate/forkmap/pprocess - fork based, POSIX only ParallelPython - Seems to meet all criteria, and is cross platform. I will be trying this one first. remoteD - Claims to be platform independent, but I don't think so. Code shows os.fork only. Last updated 2004 v0.8 processing - Is in beta V0.33 but looks promising and is cross platform. Emulates processes as threads. http://www.python.org/pypi/processing MPI based modules(probably overkill for my application): pyPar - Mature, cross platform. Has a dependency on Numeric Python + needs a C compiler. pyMpi - POSIX only . Alpha status. From lawrence livermore labs. It modifies the interpreter itself to make it multi-noded. mpi4py - ? another MPI implementation. LINKS & DISCUSSIONS http://wiki.python.org/moin/ParallelProcessing http://blog.ianbicking.org/gil-of-doom.html http://www.usenix.org/events/hotos03/tech/full_papers/vonbehren/vonbehren_html/index.html http://groups.google.com/group/comp.lang.python/browse_thread/thread/1f5d927d34f8f323/ http://groups.google.com/group/comp.lang.python/browse_frm/thread/332083cdc8bc44b/ http://groups.google.com/group/comp.lang.python/browse_frm/thread/13da24f2d6dc24a9/ http://groups.google.com/group/comp.lang.python/browse_thread/thread/f822ec289f30b26a/ http://groups.google.com/group/comp.lang.python/browse_thread/thread/902dbddfc31b8891 http://groups.google.com/group/comp.lang.python/browse_thread/thread/d8fa9ad770c17c70/ -- http://mail.python.org/mailman/listinfo/python-list