Hello again, everyone. I'm developing some custom neural network code. I'm using Python 2.6, Numpy 1.5, and Ubuntu Linux 10.10. I have an AMD 1090T six-core CPU. About six weeks ago, I asked some questions about multiprocessing in Python, and I got some very helpful responses from you all.
http://groups.google.com/group/comp.lang.python/browse_frm/thread/374e1890efbcc87b Now I'm back with a new question. I have gotten comfortable with cProfile, and with multiprocessing's various Queues (I've graduated from Pool). I just ran some extensive tests of my newest code, and I've learned some surprising things. I have a pretty picture here (be sure to view the full-size image): http://www.flickr.com/photos/15579975@N00/5744093219 I'll quickly ask my question first, to avoid a TL;DR problem: when you have a multi-core CPU with N cores, is it common to see the performance peak at N-1, or even N-2 processes? And so, should you avoid using quite as many processes as there are cores? I was expecting diminishing returns for each additional core -- but not outright declines. That's what I think my data shows for many of my trial runs. I've tried running this test twice. Once, I was reading a few PDFs and web pages while my speed test was running. But even when I wasn't using the computer for these other (light) tasks, I saw the same performance drops. Perhaps this is due to the OS overhead? The load average on my system monitor looks pretty quiet to me when I'm not running my program. OK, if you care to read further, here's some more detail... My graphs show the execution times of my neural network evaluation routine as a function of: - the size of my neural network (six sizes were tried -- with varying numbers of inputs, outputs and hidden nodes), - the subprocess configuration (either not using a subprocess, or using 1-6 subprocesses), and - the size of the input data vector (from 7 to 896 sets of inputs -- I'll explain the rationale for the exact numbers I chose if anyone cares to know). Each graph is normalized to the execution time that running the evaluation routine takes on a single CPU, without invoking a subprocess. Obviously, I'm looking for the conditions which yield performance gains above that baseline. (I'll be running this particular piece of code millions of times!) I tried 200 repetitions for each combination network size, input data size, and number of CPU cores. Even so, there was substantial irregularity in the timing graphs. So, rather than connecting the dots directly, which would lead to some messy crossing lines which are a bit hard to read, I fit B-spline curves to the data. As I anticipated, there is a performance penalty that is incurred just for parceling out the data to the multiple processes and collating the results at the end. When the data set is small, it's faster to send it to a single CPU, without invoking a subprocess. In fact, dividing a small task among 3 processes can underperform a two-process approach, and so on! See the leftmost two panels in the top row, and the rightmost two panels in the bottom row. When the networks increase in complexity, the size of the data set for which break-even performance is achieved drops accordingly. I'm more concerned about optimizing these bigger problems, obviously, because they take the longest to run. What I did not anticipate was finding that performance reversal with added computing power for large data sets. Comments are appreciated! -- http://mail.python.org/mailman/listinfo/python-list