Just an update. Yes, binding to all is as same as binding to none. I was
mistaken by my memory :)


On Fri, Apr 11, 2014 at 1:22 AM, Saliya Ekanayake <esal...@gmail.com> wrote:

> Thank you Ralph for the details and it's a good point you mentioned on
> mapping by node vs socket. We have another program that uses a chain of
> send receives, which will benefit from having consecutive ranks nearby.
>
> I've a question on bind to none being equal to bind to all. I understand
> the two concepts mean the same thing, but I remember seeing poor
> performance when bind to none is explicitly given. I need to check the
> options I used and will let you know.
>
> Yes, this test was mainly to understand how different patterns perform and
> it seems P=1 is not suitable for this hardware configuration and may be in
> general as you've mentioned.
>
> Thank you,
> Saliya
>
>
> On Fri, Apr 11, 2014 at 12:30 AM, Ralph Castain <r...@open-mpi.org> wrote:
>
>> Interesting data. Couple of quick points that might help:
>>
>> option B is equivalent to --map-by node --bind-to none. When you bind to
>> every core on the node, we don't bind you at all since "bind to all" is
>> exactly equivalent to "bind to none". So it will definitely run slower as
>> the threads run across the different NUMA regions on the node.
>>
>> You might also want to try --map-by socket, with no binding directive.
>> This would map one process to each socket, binding it to the socket - which
>> is similar to what your option A actually accomplished. The only difference
>> is that the procs that share a node will differ in rank by 1, whereas
>> option A would have those procs differ in rank by N. Depending on your
>> communication pattern, this could make a big difference.
>>
>> Map-by socket is typically the fastest performance for threaded apps. You
>> generally don't want P=1 unless you have a *lot* of threads in the process
>> as it removes any use of shared memory, and so messaging will run slower -
>> and you want the ranks that share a node to be the ones that most
>> frequently communicate to each other, if you can identify them.
>>
>> HTH
>> Ralph
>>
>> On Apr 10, 2014, at 5:59 PM, Saliya Ekanayake <esal...@gmail.com> wrote:
>>
>> Hi,
>>
>> I am evaluating the performance of a clustering program written in Java
>> with MPI+threads and would like to get some insight in solving a peculiar
>> case. I've attached a performance graph to explain this.
>>
>> In essence the tests were carried out as TxPxN, where T is threads per
>> process, P is processes per node, and N is number of nodes. I noticed an
>> inefficiency with Tx*1*xN cases in general (tall bars in graph).
>>
>> To elaborate a bit further,
>> 1. each node has 2 sockets with 4 cores each (totaling 8 cores)
>> 2. used OpenMPI 1.7.5rc5 (later tested with 1.8 and observed the same)
>> 3. with options
>>      A.) --map-by node:PE=4 and --bind-to core
>>      B.) --map-by node:PE=8 and --bind-to-core
>>      C.) --map-by socket and --bind-to none
>>
>> Timing of A,B,C came out as A < B < C, so used results from option A for
>> Tx*1*xN in the graph.
>>
>> Could you please give some suggestion that may help to speed up these Tx
>> *1*xN cases? Also, I expected B to perform better than A as threads
>> could utilize all 8 cores, but it wasn't the case.
>>
>> Thank you,
>> Saliya
>>
>>
>> <image.png>
>>
>> --
>> Saliya Ekanayake esal...@gmail.com
>> Cell 812-391-4914 Home 812-961-6383
>> http://saliya.org
>>  _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>
>
> --
> Saliya Ekanayake esal...@gmail.com
> Cell 812-391-4914 Home 812-961-6383
> http://saliya.org
>



-- 
Saliya Ekanayake esal...@gmail.com
Cell 812-391-4914 Home 812-961-6383
http://saliya.org

Reply via email to