> I tend to ramble, and I am afraid none of you busy experts will bother > reading my long post
I think that's a fairly accurate description, and prediction. > I am hoping people > with experience using any of these would chime in with tips. The main thing > I would look for in a toolkit is maturity and no extra dependencies. Plus a > wide user community is always good. > POSH/parallelpython/mpi4py/pyPar/Kamaelia/Twisted I am so confused :( After reading your problem description, I think you need none of these. Parallelization-wise, it seems to be a fairly straight-forward task, with coarse-grained parallelism (which is the easier kind). > 2. The processing involves multiple steps that each input file has to go > through. I am trying to decide between a batch mode design and a pipelined > design for concurrency. Try to avoid synchronization as much as you can if you want a good speed-up. Have long-running tasks that don't need to interact with each other, rather than having frequent synchronization. If you haven't read about Amdahl's law, do so now. >From your description, it seems that having a single process process an entire input file, from the beginning to the end, sounds like the right approach; use the multiple CPUs to process different input files in parallel (IIUC, they can be processed in any order, and simultaneously). > In the batched design, all files will be processed > on one processing step(in parallel) before the next step is started. Why that? If you have N "worker" processes, each one should do the processing at its own rate. I.e. each one does all the steps for a single file in sequence; no need to wait until all processes have completed one step before starting the next one. As for N: don't make it the number of input files. Instead, make it the number of CPUs, or perhaps two times the number of CPUs. The maximum speed-up out of k CPUs is k, so it is pointless (and memory-consuming) to have many-more-than-k worker processes. > In a pipelined design, each file will be taken through all steps to the end. Perhaps this is what I suggested above - I would not call it "pipelined", though, because in pipelining, I would expect that the separate steps of a pipeline run in parallel, each one passing its output to the next step in the pipeline. That allows for a maximum speedup equal to the number of pipeline steps. IIUC, you have many more input documents than pipeline steps, so parallelizing by input data allows for higher speedups. > The subprocess module should allow spawning new processes, > but I am not sure how to get status/error codes back from those? What's wrong with the returncode attribute in subprocess? > I guess I can forget about communicating with those? Assuming all you want to know is whether they succeeded or failed: yes. Just look at the exit status. > I guess I would need some sort of queuing system here, to submit files > to the CPUs properly? No. You don't submit files to CPUs, you submit them to processes (if you "submit" anything at all). The operating system will chose a CPU for you. If you follow the architecture I proposed above (i.e. one Python process processes a file from the beginning to the end), you pass the file to be processed on the command line. If the processing is fairly short (which I assume it is not), you could have a single process process multiple input files, one after another (e.g. process 1 does files 1..400, process 401..800, and so on). > The other way could be to have each individual file run through all the > steps and have multiple such "pipelines" running simultaneously in parallel. Ok. See above - I wouldn't call it piplelined, but this is what you should do. > It feels like this method will lose cache performance because all the code > for all the steps will be loaded at the same time, but I am not sure if I > should be worrying about that. [I assume you mean "CPU cache" here] You should not worry about it, plus I very much doubt that this actually happens. The code for all the steps will *not* be loaded into the cache; that it sits in main memory has no effect whatsoever on cache performance. You can assume that all the code that runs a single step will be in the cache (unless the algorithms are very large). OTOH, the data of a input file will probably not fit in the CPU cache. You should be worried about reducing disk IO, so all processing of a file should run completely in memory. For that, it is better if you run an input file from the beginning to the end - if you would first read all 800 files, decompress them, then likely the output will not fit into the disk cache, so the system will have to read the compressed data from disk in the next step. Regards, Martin -- http://mail.python.org/mailman/listinfo/python-list