Just an update. Yes, binding to all is as same as binding to none. I was mistaken by my memory :)
On Fri, Apr 11, 2014 at 1:22 AM, Saliya Ekanayake <esal...@gmail.com> wrote: > Thank you Ralph for the details and it's a good point you mentioned on > mapping by node vs socket. We have another program that uses a chain of > send receives, which will benefit from having consecutive ranks nearby. > > I've a question on bind to none being equal to bind to all. I understand > the two concepts mean the same thing, but I remember seeing poor > performance when bind to none is explicitly given. I need to check the > options I used and will let you know. > > Yes, this test was mainly to understand how different patterns perform and > it seems P=1 is not suitable for this hardware configuration and may be in > general as you've mentioned. > > Thank you, > Saliya > > > On Fri, Apr 11, 2014 at 12:30 AM, Ralph Castain <r...@open-mpi.org> wrote: > >> Interesting data. Couple of quick points that might help: >> >> option B is equivalent to --map-by node --bind-to none. When you bind to >> every core on the node, we don't bind you at all since "bind to all" is >> exactly equivalent to "bind to none". So it will definitely run slower as >> the threads run across the different NUMA regions on the node. >> >> You might also want to try --map-by socket, with no binding directive. >> This would map one process to each socket, binding it to the socket - which >> is similar to what your option A actually accomplished. The only difference >> is that the procs that share a node will differ in rank by 1, whereas >> option A would have those procs differ in rank by N. Depending on your >> communication pattern, this could make a big difference. >> >> Map-by socket is typically the fastest performance for threaded apps. You >> generally don't want P=1 unless you have a *lot* of threads in the process >> as it removes any use of shared memory, and so messaging will run slower - >> and you want the ranks that share a node to be the ones that most >> frequently communicate to each other, if you can identify them. >> >> HTH >> Ralph >> >> On Apr 10, 2014, at 5:59 PM, Saliya Ekanayake <esal...@gmail.com> wrote: >> >> Hi, >> >> I am evaluating the performance of a clustering program written in Java >> with MPI+threads and would like to get some insight in solving a peculiar >> case. I've attached a performance graph to explain this. >> >> In essence the tests were carried out as TxPxN, where T is threads per >> process, P is processes per node, and N is number of nodes. I noticed an >> inefficiency with Tx*1*xN cases in general (tall bars in graph). >> >> To elaborate a bit further, >> 1. each node has 2 sockets with 4 cores each (totaling 8 cores) >> 2. used OpenMPI 1.7.5rc5 (later tested with 1.8 and observed the same) >> 3. with options >> A.) --map-by node:PE=4 and --bind-to core >> B.) --map-by node:PE=8 and --bind-to-core >> C.) --map-by socket and --bind-to none >> >> Timing of A,B,C came out as A < B < C, so used results from option A for >> Tx*1*xN in the graph. >> >> Could you please give some suggestion that may help to speed up these Tx >> *1*xN cases? Also, I expected B to perform better than A as threads >> could utilize all 8 cores, but it wasn't the case. >> >> Thank you, >> Saliya >> >> >> <image.png> >> >> -- >> Saliya Ekanayake esal...@gmail.com >> Cell 812-391-4914 Home 812-961-6383 >> http://saliya.org >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > > > -- > Saliya Ekanayake esal...@gmail.com > Cell 812-391-4914 Home 812-961-6383 > http://saliya.org > -- Saliya Ekanayake esal...@gmail.com Cell 812-391-4914 Home 812-961-6383 http://saliya.org