Re: [OMPI users] Running on two nodes slower than running on one node

Tim Prince Thu, 30 Jan 2014 07:43:39 -0500 (EST)


On 1/29/2014 11:30 PM, Ralph Castain wrote:

On Jan 29, 2014, at 7:56 PM, Victor <victor.ma...@gmail.com<mailto:victor.ma...@gmail.com>> wrote:
Thanks for the insights Tim. I was aware that the CPUs will chokebeyond a certain point. From memory on my machine this happens with 5concurrent MPI jobs with that benchmark that I am using.
My primary question was about scaling between the nodes. I was notgetting close to double the performance when running MPI jobs acrostwo 4 core nodes. It may be better now since I have Open-MX in place,but I have not repeated the benchmarks yet since I need to get onesimulation job done asap.
Some of that may be due to expected loss of performance when youswitch from shared memory to inter-node transports. While it is trueabout saturation of the memory path, what you reported could be moreconsistent with that transition - i.e., it isn't unusual to seeapplications perform better when run on a single node, depending uponhow they are written, up to a certain size of problem (which your codemay not be hitting).
Regarding your mention of setting affinities and MPI ranks do youhave a specific (as in syntactically specific since I am a novice andeasily confused...) examples how I may want to set affinities to getthe Westmere node performing better?
mpirun --bind-to-core -cpus-per-rank 2 ...
will bind each MPI rank to 2 cores. Note that this will definitely*not* be a good idea if you are running more than two threads in yourprocess - if you are, then set --cpus-per-rank to the number ofthreads, keeping in mind that you want things to break evenly acrossthe sockets. In other words, if you have two 6 core/socket Westmere'son the node, then you either want to run 6 process at cpus-per-rank=2if each process runs 2 threads, or 4 processes with cpus-per-rank=3 ifeach process runs 3 threads, or 2 processes with no cpus-per-rank but--bind-to-socket instead of --bind-to-core for any other thread number> 3.
You would not want to run any other number of processes on the node orelse the binding pattern will cause a single process to split itsthreads across the sockets - which will definitely hurt performance.

-cpus-per-rank 2 is an effective choice for this platform. As Ralphsaid, it should work automatically for 2 threads per rank.Ralph's point about not splitting a process across sockets is animportant one. Even splitting a process across internal busses, whichwould happen with 3 threads per process, seems problematical.


--
Tim Prince

Re: [OMPI users] Running on two nodes slower than running on one node

Reply via email to