Interesting data. Couple of quick points that might help:

option B is equivalent to --map-by node --bind-to none. When you bind to every 
core on the node, we don't bind you at all since "bind to all" is exactly 
equivalent to "bind to none". So it will definitely run slower as the threads 
run across the different NUMA regions on the node.

You might also want to try --map-by socket, with no binding directive. This 
would map one process to each socket, binding it to the socket - which is 
similar to what your option A actually accomplished. The only difference is 
that the procs that share a node will differ in rank by 1, whereas option A 
would have those procs differ in rank by N. Depending on your communication 
pattern, this could make a big difference.

Map-by socket is typically the fastest performance for threaded apps. You 
generally don't want P=1 unless you have a *lot* of threads in the process as 
it removes any use of shared memory, and so messaging will run slower - and you 
want the ranks that share a node to be the ones that most frequently 
communicate to each other, if you can identify them.

HTH
Ralph

On Apr 10, 2014, at 5:59 PM, Saliya Ekanayake <esal...@gmail.com> wrote:

> Hi,
> 
> I am evaluating the performance of a clustering program written in Java with 
> MPI+threads and would like to get some insight in solving a peculiar case. 
> I've attached a performance graph to explain this.
> 
> In essence the tests were carried out as TxPxN, where T is threads per 
> process, P is processes per node, and N is number of nodes. I noticed an 
> inefficiency with Tx1xN cases in general (tall bars in graph).
> 
> To elaborate a bit further, 
> 1. each node has 2 sockets with 4 cores each (totaling 8 cores) 
> 2. used OpenMPI 1.7.5rc5 (later tested with 1.8 and observed the same)
> 3. with options
>      A.) --map-by node:PE=4 and --bind-to core
>      B.) --map-by node:PE=8 and --bind-to-core
>      C.) --map-by socket and --bind-to none
> 
> Timing of A,B,C came out as A < B < C, so used results from option A for 
> Tx1xN in the graph. 
> 
> Could you please give some suggestion that may help to speed up these Tx1xN 
> cases? Also, I expected B to perform better than A as threads could utilize 
> all 8 cores, but it wasn't the case.
> 
> Thank you,
> Saliya
> 
> 
> <image.png>
> 
> -- 
> Saliya Ekanayake esal...@gmail.com 
> Cell 812-391-4914 Home 812-961-6383
> http://saliya.org
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to