Re: [OMPI users] loading processes per node

Ralph Castain Fri, 25 Mar 2016 15:43:42 -0400 (EDT)

Yeah, it can really have an impact! It is unfortunately highly 
application-specific, so all we can do is provide the tools.


As you can see from the binding map, we are tight packing the procs on each 
node to maximize the use of shared memory. However, this assumes that each rank 
is predominantly going to “talk” to rank+/-1 - i.e., the pattern involves 
nearest neighboring ranks. If that isn’t true (e.g., the lowest ranked process 
on one node talks to the the lowest ranked process on the next node, etc.), 
then this would be a bad mapping for performance.

In that case, you can use the “rank-by” option to maintain the location and 
binding, but change the assigned MCW ranks to align with your communication 
pattern.

HTH
Ralph

 
> On Mar 25, 2016, at 12:28 PM, Ronald Cohen <recoh...@gmail.com> wrote:
> 
> So I have been experimenting with different mappings, and performance
> varies a lot. The best I find is:
> -map-by slot:pe=2  -np 32
> with 2 threads
> which gives
> [n001.cluster.com <http://n001.cluster.com/>:29647] MCW rank 0 bound to 
> socket 0[core 0[hwt 0]],
> socket 0[core 1[hwt 0]]: [B/B/./././././.][./././././././.]
> [n001.cluster.com <http://n001.cluster.com/>:29647] MCW rank 1 bound to 
> socket 0[core 2[hwt 0]],
> socket 0[core 3[hwt 0]]: [././B/B/./././.][./././././././.]
> [n001.cluster.com <http://n001.cluster.com/>:29647] MCW rank 2 bound to 
> socket 0[core 4[hwt 0]],
> socket 0[core 5[hwt 0]]: [././././B/B/./.][./././././././.]
> [n001.cluster.com <http://n001.cluster.com/>:29647] MCW rank 3 bound to 
> socket 0[core 6[hwt 0]],
> socket 0[core 7[hwt 0]]: [././././././B/B][./././././././.]
> [n001.cluster.com <http://n001.cluster.com/>:29647] MCW rank 4 bound to 
> socket 1[core 8[hwt 0]],
> socket 1[core 9[hwt 0]]: [./././././././.][B/B/./././././.]
> [n001.cluster.com <http://n001.cluster.com/>:29647] MCW rank 5 bound to 
> socket 1[core 10[hwt 0]],
> socket 1[core 11[hwt 0]]: [./././././././.][././B/B/./././.]
> [n001.cluster.com <http://n001.cluster.com/>:29647] MCW rank 6 bound to 
> socket 1[core 12[hwt 0]],
> socket 1[core 13[hwt 0]]: [./././././././.][././././B/B/./.]
> [n001.cluster.com <http://n001.cluster.com/>:29647] MCW rank 7 bound to 
> socket 1[core 14[hwt 0]],
> socket 1[core 15[hwt 0]]: [./././././././.][././././././B/B]
> [n003.cluster.com <http://n003.cluster.com/>:29842] MCW rank 16 bound to 
> socket 0[core 0[hwt 0]],
> socket 0[core 1[hwt 0]]: [B/B/./././././.][./././././././.]
> [n002.cluster.com <http://n002.cluster.com/>:32210] MCW ra
> ...
> 
> ---
> Ron Cohen
> recoh...@gmail.com <mailto:recoh...@gmail.com>
> skypename: ronaldcohen
> twitter: @recohen3
> 
> 
> On Fri, Mar 25, 2016 at 3:13 PM, Ronald Cohen <recoh...@gmail.com 
> <mailto:recoh...@gmail.com>> wrote:
>> So
>> -map-by node:pe=2  -np 32
>> runs and gives great performance, though a little worse than -n 32
>> it puts the correct number of processes, but does do round robin. Is
>> there a way to do this without the round robin? Also note the error
>> message:
>> 
>> 
>> ======================   ALLOCATED NODES   ======================
>>        n001: slots=16 max_slots=0 slots_inuse=0 state=UP
>>        n004.cluster.com: slots=16 max_slots=0 slots_inuse=0 state=UP
>>        n003.cluster.com: slots=16 max_slots=0 slots_inuse=0 state=UP
>>        n002.cluster.com: slots=16 max_slots=0 slots_inuse=0 state=UP
>> =================================================================
>> --------------------------------------------------------------------------
>> WARNING: a request was made to bind a process. While the system
>> supports binding the process itself, at least one node does NOT
>> support binding memory to the process location.
>> 
>>  Node:  n001
>> 
>> This usually is due to not having the required NUMA support installed
>> on the node. In some Linux distributions, the required support is
>> contained in the libnumactl and libnumactl-devel packages.
>> This is a warning only; your job will continue, though performance may
>> be degraded.
>> --------------------------------------------------------------------------
>> [n001.cluster.com:29316] MCW rank 0 bound to socket 0[core 0[hwt 0]],
>> socket 0[core 1[hwt 0]]: [B/B/./././././.][./././././././.]
>> [n001.cluster.com:29316] MCW rank 4 bound to socket 0[core 2[hwt 0]],
>> socket 0[core 3[hwt 0]]: [././B/B/./././.][./././././././.]
>> [n001.cluster.com:29316] MCW rank 8 bound to socket 0[core 4[hwt 0]],
>> socket 0[core 5[hwt 0]]: [././././B/B/./.][./././././././.]
>> [n001.cluster.com:29316] MCW rank 12 bound to socket 0[core 6[hwt 0]],
>> socket 0[core 7[hwt 0]]: [././././././B/B][./././././././.]
>> [n001.cluster.com:29316] MCW rank 16 bound to socket 1[core 8[hwt 0]],
>> socket 1[core 9[hwt 0]]: [./././././././.][B/B/./././././.]
>> [n001.cluster.com:29316] MCW rank 20 bound to socket 1[core 10[hwt
>> 0]], socket 1[core 11[hwt 0]]: [./././././././.][././B/B/./././.]
>> [n001.cluster.com:29316] MCW rank 24 bound to socket 1[core 12[hwt
>> 0]], socket 1[core 13[hwt 0]]: [./././././././.][././././B/B/./.]
>> [n001.cluster.com:29316] MCW rank 28 bound to socket 1[core 14[hwt
>> 0]], socket 1[core 15[hwt 0]]: [./././././././.][././././././B/B]
>> [n003.cluster.com:29704] MCW rank 22 bound to socket 1[core 10[hwt
>> 0]], socket 1[core 11[hwt 0]]: [./././././././.][././B/B/./././.]
>> ---
>> Ron Cohen
>> recoh...@gmail.com
>> skypename: ronaldcohen
>> twitter: @recohen3
>> 
>> 
>> On Fri, Mar 25, 2016 at 2:32 PM, Ronald Cohen <recoh...@gmail.com> wrote:
>>> So it seems my
>>> -map-by core:pe=2 -n 32
>>> should have worked . I would have 32 procs with 2 on each, giving 64 total.
>>> But it doesn't
>>> ---
>>> Ron Cohen
>>> recoh...@gmail.com
>>> skypename: ronaldcohen
>>> twitter: @recohen3
>>> 
>>> 
>>> On Fri, Mar 25, 2016 at 2:19 PM, Ralph Castain <r...@open-mpi.org> wrote:
>>>> pe=N tells us to map N cores (we call them “processing elements” because
>>>> they could be HTs if you —use-hwthreads-as-cpus) to each process. So we 
>>>> will
>>>> bind each process to N cores.
>>>> 
>>>> So if you want 16 procs, each with two processing elements assigned to them
>>>> (which is a good choice if you are using 2 threads/process), then you would
>>>> use:
>>>> 
>>>> mpirun -map-by core:pe=2 -np 16
>>>> 
>>>> If you add -report-bindings, you’ll see each process bound to two cores,
>>>> with the procs tightly packed on each node until that node’s cores are 
>>>> fully
>>>> utilized. We do handle the unlikely event that you asked for a non-integer
>>>> multiple of cores - i.e., if you have 32 cores on a node, and you ask for
>>>> pe=6, we will wind up leaving two cores idle.
>>>> 
>>>> HTH
>>>> Ralph
>>>> 
>>>> On Mar 25, 2016, at 11:11 AM, Ronald Cohen <recoh...@gmail.com> wrote:
>>>> 
>>>> or is it mpirun -map-by core:pe=8 -n 16 ?
>>>> 
>>>> ---
>>>> Ron Cohen
>>>> recoh...@gmail.com
>>>> skypename: ronaldcohen
>>>> twitter: @recohen3
>>>> 
>>>> 
>>>> On Fri, Mar 25, 2016 at 2:10 PM, Ronald Cohen <recoh...@gmail.com> wrote:
>>>> 
>>>> Thank you--I looked on the man page and it is not clear to me what
>>>> pe=2 does. Is that the number of threads? So if I want 16 mpi procs
>>>> with 2 threads is it for 32 cores (two nodes)
>>>> 
>>>> mpirun -map-by core:pe=2 -n 16
>>>> 
>>>> ?
>>>> 
>>>> Sorry if I mangled this.
>>>> 
>>>> 
>>>> Ron
>>>> 
>>>> ---
>>>> Ron Cohen
>>>> recoh...@gmail.com
>>>> skypename: ronaldcohen
>>>> twitter: @recohen3
>>>> 
>>>> 
>>>> On Fri, Mar 25, 2016 at 2:03 PM, Ralph Castain <r...@open-mpi.org> wrote:
>>>> 
>>>> Okay, what I would suggest is that you use the following cmd line:
>>>> 
>>>> mpirun -map-by core:pe=2 (or 8 or whatever number you want)
>>>> 
>>>> This should give you the best performance as it will tight-pack the procs
>>>> and assign them to the correct number of cores. See if that helps
>>>> 
>>>> On Mar 25, 2016, at 10:38 AM, Ronald Cohen <recoh...@gmail.com> wrote:
>>>> 
>>>> 1.10.2
>>>> 
>>>> Ron
>>>> 
>>>> ---
>>>> Ron Cohen
>>>> recoh...@gmail.com
>>>> skypename: ronaldcohen
>>>> twitter: @recohen3
>>>> 
>>>> 
>>>> On Fri, Mar 25, 2016 at 1:30 PM, Ralph Castain <r...@open-mpi.org> wrote:
>>>> 
>>>> Hmmm…what version of OMPI are you using?
>>>> 
>>>> 
>>>> On Mar 25, 2016, at 10:27 AM, Ronald Cohen <recoh...@gmail.com> wrote:
>>>> 
>>>> --report-bindings didn't report anything
>>>> ---
>>>> Ron Cohen
>>>> recoh...@gmail.com
>>>> skypename: ronaldcohen
>>>> twitter: @recohen3
>>>> 
>>>> 
>>>> On Fri, Mar 25, 2016 at 1:24 PM, Ronald Cohen <recoh...@gmail.com> wrote:
>>>> 
>>>> —display-allocation an
>>>> didn't seem to give useful information:
>>>> 
>>>> ======================   ALLOCATED NODES   ======================
>>>>     n005: slots=16 max_slots=0 slots_inuse=0 state=UP
>>>>     n008.cluster.com: slots=16 max_slots=0 slots_inuse=0 state=UP
>>>>     n007.cluster.com: slots=16 max_slots=0 slots_inuse=0 state=UP
>>>>     n006.cluster.com: slots=16 max_slots=0 slots_inuse=0 state=UP
>>>> =================================================================
>>>> 
>>>> for
>>>> mpirun -display-allocation  --map-by ppr:8:node -n 32
>>>> 
>>>> Ron
>>>> 
>>>> ---
>>>> Ron Cohen
>>>> recoh...@gmail.com
>>>> skypename: ronaldcohen
>>>> twitter: @recohen3
>>>> 
>>>> 
>>>> On Fri, Mar 25, 2016 at 1:17 PM, Ronald Cohen <recoh...@gmail.com> wrote:
>>>> 
>>>> Actually there was the same number of procs per node in each case. I
>>>> verified this by logging into the nodes while they were running--in
>>>> both cases 4 per node .
>>>> 
>>>> Ron
>>>> 
>>>> ---
>>>> Ron Cohen
>>>> recoh...@gmail.com
>>>> skypename: ronaldcohen
>>>> twitter: @recohen3
>>>> 
>>>> 
>>>> On Fri, Mar 25, 2016 at 1:14 PM, Ralph Castain <r...@open-mpi.org> wrote:
>>>> 
>>>> 
>>>> On Mar 25, 2016, at 9:59 AM, Ronald Cohen <recoh...@gmail.com> wrote:
>>>> 
>>>> It is very strange but my program runs slower with any of these
>>>> choices than if IO simply use:
>>>> 
>>>> mpirun  -n 16
>>>> with
>>>> #PBS -l
>>>> nodes=n013.cluster.com:ppn=4+n014.cluster.com:ppn=4+n015.cluster.com:ppn=4+n016.cluster.com:ppn=4
>>>> for example.
>>>> 
>>>> 
>>>> This command will tightly pack as many procs as possible on a node - note
>>>> that we may well not see the PBS directives regarding number of ppn. Add
>>>> —display-allocation and let’s see how many slots we think were assigned on
>>>> each node
>>>> 
>>>> 
>>>> The timing for the latter is 165 seconds, and for
>>>> #PBS -l nodes=4:ppn=16,pmem=1gb
>>>> mpirun  --map-by ppr:4:node -n 16
>>>> it is 368 seconds.
>>>> 
>>>> 
>>>> It will typically be faster if you pack more procs/node as they can use
>>>> shared memory for communication.
>>>> 
>>>> 
>>>> Ron
>>>> 
>>>> ---
>>>> Ron Cohen
>>>> recoh...@gmail.com
>>>> skypename: ronaldcohen
>>>> twitter: @recohen3
>>>> 
>>>> 
>>>> On Fri, Mar 25, 2016 at 12:43 PM, Ralph Castain <r...@open-mpi.org> wrote:
>>>> 
>>>> 
>>>> On Mar 25, 2016, at 9:40 AM, Ronald Cohen <recoh...@gmail.com> wrote:
>>>> 
>>>> Thank you! I will try it!
>>>> 
>>>> 
>>>> What would
>>>> -cpus-per-proc  4 -n 16
>>>> do?
>>>> 
>>>> 
>>>> This would bind each process to 4 cores, filling each node with procs until
>>>> the cores on that node were exhausted, to a total of 16 processes within 
>>>> the
>>>> allocation.
>>>> 
>>>> 
>>>> Ron
>>>> ---
>>>> Ron Cohen
>>>> recoh...@gmail.com
>>>> skypename: ronaldcohen
>>>> twitter: @recohen3
>>>> 
>>>> 
>>>> On Fri, Mar 25, 2016 at 12:38 PM, Ralph Castain <r...@open-mpi.org> wrote:
>>>> 
>>>> Add -rank-by node to your cmd line. You’ll still get 4 procs/node, but they
>>>> will be ranked by node instead of consecutively within a node.
>>>> 
>>>> 
>>>> 
>>>> On Mar 25, 2016, at 9:30 AM, Ronald Cohen <recoh...@gmail.com> wrote:
>>>> 
>>>> I am using
>>>> 
>>>> mpirun  --map-by ppr:4:node -n 16
>>>> 
>>>> and this loads the processes in round robin fashion. This seems to be
>>>> twice as slow for my code as loading them node by node, 4 processes
>>>> per node.
>>>> 
>>>> How can I not load them round robin, but node by node?
>>>> 
>>>> Thanks!
>>>> 
>>>> Ron
>>>> 
>>>> 
>>>> ---
>>>> Ron Cohen
>>>> recoh...@gmail.com
>>>> skypename: ronaldcohen
>>>> twitter: @recohen3
>>>> 
>>>> ---
>>>> Ronald Cohen
>>>> Geophysical Laboratory
>>>> Carnegie Institution
>>>> 5251 Broad Branch Rd., N.W.
>>>> Washington, D.C. 20015
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> Link to this post:
>>>> http://www.open-mpi.org/community/lists/users/2016/03/28828.php
>>>> 
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> Link to this post:
>>>> http://www.open-mpi.org/community/lists/users/2016/03/28829.php
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> Link to this post:
>>>> http://www.open-mpi.org/community/lists/users/2016/03/28830.php
>>>> 
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> Link to this post:
>>>> http://www.open-mpi.org/community/lists/users/2016/03/28831.php
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> Link to this post:
>>>> http://www.open-mpi.org/community/lists/users/2016/03/28832.php
>>>> 
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> Link to this post:
>>>> http://www.open-mpi.org/community/lists/users/2016/03/28833.php
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> Link to this post:
>>>> http://www.open-mpi.org/community/lists/users/2016/03/28837.php
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> Link to this post:
>>>> http://www.open-mpi.org/community/lists/users/2016/03/28840.php
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> Link to this post:
>>>> http://www.open-mpi.org/community/lists/users/2016/03/28843.php
>>>> 
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> Link to this post:
>>>> http://www.open-mpi.org/community/lists/users/2016/03/28844.php
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> Link to this post:
>>>> http://www.open-mpi.org/community/lists/users/2016/03/28846.php
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> Link to this post:
>>>> http://www.open-mpi.org/community/lists/users/2016/03/28847.php
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2016/03/28851.php 
> <http://www.open-mpi.org/community/lists/users/2016/03/28851.php>

Re: [OMPI users] loading processes per node

Reply via email to