Re: [OMPI users] loading processes per node

Ronald Cohen Fri, 25 Mar 2016 15:28:51 -0400 (EDT)

So I have been experimenting with different mappings, and performance
varies a lot. The best I find is:
-map-by slot:pe=2  -np 32
with 2 threads
which gives
[n001.cluster.com:29647] MCW rank 0 bound to socket 0[core 0[hwt 0]],
socket 0[core 1[hwt 0]]: [B/B/./././././.][./././././././.]
[n001.cluster.com:29647] MCW rank 1 bound to socket 0[core 2[hwt 0]],
socket 0[core 3[hwt 0]]: [././B/B/./././.][./././././././.]
[n001.cluster.com:29647] MCW rank 2 bound to socket 0[core 4[hwt 0]],
socket 0[core 5[hwt 0]]: [././././B/B/./.][./././././././.]
[n001.cluster.com:29647] MCW rank 3 bound to socket 0[core 6[hwt 0]],
socket 0[core 7[hwt 0]]: [././././././B/B][./././././././.]
[n001.cluster.com:29647] MCW rank 4 bound to socket 1[core 8[hwt 0]],
socket 1[core 9[hwt 0]]: [./././././././.][B/B/./././././.]
[n001.cluster.com:29647] MCW rank 5 bound to socket 1[core 10[hwt 0]],
socket 1[core 11[hwt 0]]: [./././././././.][././B/B/./././.]
[n001.cluster.com:29647] MCW rank 6 bound to socket 1[core 12[hwt 0]],
socket 1[core 13[hwt 0]]: [./././././././.][././././B/B/./.]
[n001.cluster.com:29647] MCW rank 7 bound to socket 1[core 14[hwt 0]],
socket 1[core 15[hwt 0]]: [./././././././.][././././././B/B]
[n003.cluster.com:29842] MCW rank 16 bound to socket 0[core 0[hwt 0]],
socket 0[core 1[hwt 0]]: [B/B/./././././.][./././././././.]
[n002.cluster.com:32210] MCW ra
...


---
Ron Cohen
[email protected]
skypename: ronaldcohen
twitter: @recohen3


On Fri, Mar 25, 2016 at 3:13 PM, Ronald Cohen <[email protected]> wrote:
> So
>  -map-by node:pe=2  -np 32
> runs and gives great performance, though a little worse than -n 32
> it puts the correct number of processes, but does do round robin. Is
> there a way to do this without the round robin? Also note the error
> message:
>
>
> ======================   ALLOCATED NODES   ======================
>         n001: slots=16 max_slots=0 slots_inuse=0 state=UP
>         n004.cluster.com: slots=16 max_slots=0 slots_inuse=0 state=UP
>         n003.cluster.com: slots=16 max_slots=0 slots_inuse=0 state=UP
>         n002.cluster.com: slots=16 max_slots=0 slots_inuse=0 state=UP
> =================================================================
> --------------------------------------------------------------------------
> WARNING: a request was made to bind a process. While the system
> supports binding the process itself, at least one node does NOT
> support binding memory to the process location.
>
>   Node:  n001
>
> This usually is due to not having the required NUMA support installed
> on the node. In some Linux distributions, the required support is
> contained in the libnumactl and libnumactl-devel packages.
> This is a warning only; your job will continue, though performance may
> be degraded.
> --------------------------------------------------------------------------
> [n001.cluster.com:29316] MCW rank 0 bound to socket 0[core 0[hwt 0]],
> socket 0[core 1[hwt 0]]: [B/B/./././././.][./././././././.]
> [n001.cluster.com:29316] MCW rank 4 bound to socket 0[core 2[hwt 0]],
> socket 0[core 3[hwt 0]]: [././B/B/./././.][./././././././.]
> [n001.cluster.com:29316] MCW rank 8 bound to socket 0[core 4[hwt 0]],
> socket 0[core 5[hwt 0]]: [././././B/B/./.][./././././././.]
> [n001.cluster.com:29316] MCW rank 12 bound to socket 0[core 6[hwt 0]],
> socket 0[core 7[hwt 0]]: [././././././B/B][./././././././.]
> [n001.cluster.com:29316] MCW rank 16 bound to socket 1[core 8[hwt 0]],
> socket 1[core 9[hwt 0]]: [./././././././.][B/B/./././././.]
> [n001.cluster.com:29316] MCW rank 20 bound to socket 1[core 10[hwt
> 0]], socket 1[core 11[hwt 0]]: [./././././././.][././B/B/./././.]
> [n001.cluster.com:29316] MCW rank 24 bound to socket 1[core 12[hwt
> 0]], socket 1[core 13[hwt 0]]: [./././././././.][././././B/B/./.]
> [n001.cluster.com:29316] MCW rank 28 bound to socket 1[core 14[hwt
> 0]], socket 1[core 15[hwt 0]]: [./././././././.][././././././B/B]
> [n003.cluster.com:29704] MCW rank 22 bound to socket 1[core 10[hwt
> 0]], socket 1[core 11[hwt 0]]: [./././././././.][././B/B/./././.]
> ---
> Ron Cohen
> [email protected]
> skypename: ronaldcohen
> twitter: @recohen3
>
>
> On Fri, Mar 25, 2016 at 2:32 PM, Ronald Cohen <[email protected]> wrote:
>> So it seems my
>> -map-by core:pe=2 -n 32
>> should have worked . I would have 32 procs with 2 on each, giving 64 total.
>> But it doesn't
>> ---
>> Ron Cohen
>> [email protected]
>> skypename: ronaldcohen
>> twitter: @recohen3
>>
>>
>> On Fri, Mar 25, 2016 at 2:19 PM, Ralph Castain <[email protected]> wrote:
>>> pe=N tells us to map N cores (we call them “processing elements” because
>>> they could be HTs if you —use-hwthreads-as-cpus) to each process. So we will
>>> bind each process to N cores.
>>>
>>> So if you want 16 procs, each with two processing elements assigned to them
>>> (which is a good choice if you are using 2 threads/process), then you would
>>> use:
>>>
>>> mpirun -map-by core:pe=2 -np 16
>>>
>>> If you add -report-bindings, you’ll see each process bound to two cores,
>>> with the procs tightly packed on each node until that node’s cores are fully
>>> utilized. We do handle the unlikely event that you asked for a non-integer
>>> multiple of cores - i.e., if you have 32 cores on a node, and you ask for
>>> pe=6, we will wind up leaving two cores idle.
>>>
>>> HTH
>>> Ralph
>>>
>>> On Mar 25, 2016, at 11:11 AM, Ronald Cohen <[email protected]> wrote:
>>>
>>> or is it mpirun -map-by core:pe=8 -n 16 ?
>>>
>>> ---
>>> Ron Cohen
>>> [email protected]
>>> skypename: ronaldcohen
>>> twitter: @recohen3
>>>
>>>
>>> On Fri, Mar 25, 2016 at 2:10 PM, Ronald Cohen <[email protected]> wrote:
>>>
>>> Thank you--I looked on the man page and it is not clear to me what
>>> pe=2 does. Is that the number of threads? So if I want 16 mpi procs
>>> with 2 threads is it for 32 cores (two nodes)
>>>
>>> mpirun -map-by core:pe=2 -n 16
>>>
>>> ?
>>>
>>> Sorry if I mangled this.
>>>
>>>
>>> Ron
>>>
>>> ---
>>> Ron Cohen
>>> [email protected]
>>> skypename: ronaldcohen
>>> twitter: @recohen3
>>>
>>>
>>> On Fri, Mar 25, 2016 at 2:03 PM, Ralph Castain <[email protected]> wrote:
>>>
>>> Okay, what I would suggest is that you use the following cmd line:
>>>
>>> mpirun -map-by core:pe=2 (or 8 or whatever number you want)
>>>
>>> This should give you the best performance as it will tight-pack the procs
>>> and assign them to the correct number of cores. See if that helps
>>>
>>> On Mar 25, 2016, at 10:38 AM, Ronald Cohen <[email protected]> wrote:
>>>
>>> 1.10.2
>>>
>>> Ron
>>>
>>> ---
>>> Ron Cohen
>>> [email protected]
>>> skypename: ronaldcohen
>>> twitter: @recohen3
>>>
>>>
>>> On Fri, Mar 25, 2016 at 1:30 PM, Ralph Castain <[email protected]> wrote:
>>>
>>> Hmmm…what version of OMPI are you using?
>>>
>>>
>>> On Mar 25, 2016, at 10:27 AM, Ronald Cohen <[email protected]> wrote:
>>>
>>> --report-bindings didn't report anything
>>> ---
>>> Ron Cohen
>>> [email protected]
>>> skypename: ronaldcohen
>>> twitter: @recohen3
>>>
>>>
>>> On Fri, Mar 25, 2016 at 1:24 PM, Ronald Cohen <[email protected]> wrote:
>>>
>>> —display-allocation an
>>> didn't seem to give useful information:
>>>
>>> ======================   ALLOCATED NODES   ======================
>>>      n005: slots=16 max_slots=0 slots_inuse=0 state=UP
>>>      n008.cluster.com: slots=16 max_slots=0 slots_inuse=0 state=UP
>>>      n007.cluster.com: slots=16 max_slots=0 slots_inuse=0 state=UP
>>>      n006.cluster.com: slots=16 max_slots=0 slots_inuse=0 state=UP
>>> =================================================================
>>>
>>> for
>>> mpirun -display-allocation  --map-by ppr:8:node -n 32
>>>
>>> Ron
>>>
>>> ---
>>> Ron Cohen
>>> [email protected]
>>> skypename: ronaldcohen
>>> twitter: @recohen3
>>>
>>>
>>> On Fri, Mar 25, 2016 at 1:17 PM, Ronald Cohen <[email protected]> wrote:
>>>
>>> Actually there was the same number of procs per node in each case. I
>>> verified this by logging into the nodes while they were running--in
>>> both cases 4 per node .
>>>
>>> Ron
>>>
>>> ---
>>> Ron Cohen
>>> [email protected]
>>> skypename: ronaldcohen
>>> twitter: @recohen3
>>>
>>>
>>> On Fri, Mar 25, 2016 at 1:14 PM, Ralph Castain <[email protected]> wrote:
>>>
>>>
>>> On Mar 25, 2016, at 9:59 AM, Ronald Cohen <[email protected]> wrote:
>>>
>>> It is very strange but my program runs slower with any of these
>>> choices than if IO simply use:
>>>
>>> mpirun  -n 16
>>> with
>>> #PBS -l
>>> nodes=n013.cluster.com:ppn=4+n014.cluster.com:ppn=4+n015.cluster.com:ppn=4+n016.cluster.com:ppn=4
>>> for example.
>>>
>>>
>>> This command will tightly pack as many procs as possible on a node - note
>>> that we may well not see the PBS directives regarding number of ppn. Add
>>> —display-allocation and let’s see how many slots we think were assigned on
>>> each node
>>>
>>>
>>> The timing for the latter is 165 seconds, and for
>>> #PBS -l nodes=4:ppn=16,pmem=1gb
>>> mpirun  --map-by ppr:4:node -n 16
>>> it is 368 seconds.
>>>
>>>
>>> It will typically be faster if you pack more procs/node as they can use
>>> shared memory for communication.
>>>
>>>
>>> Ron
>>>
>>> ---
>>> Ron Cohen
>>> [email protected]
>>> skypename: ronaldcohen
>>> twitter: @recohen3
>>>
>>>
>>> On Fri, Mar 25, 2016 at 12:43 PM, Ralph Castain <[email protected]> wrote:
>>>
>>>
>>> On Mar 25, 2016, at 9:40 AM, Ronald Cohen <[email protected]> wrote:
>>>
>>> Thank you! I will try it!
>>>
>>>
>>> What would
>>> -cpus-per-proc  4 -n 16
>>> do?
>>>
>>>
>>> This would bind each process to 4 cores, filling each node with procs until
>>> the cores on that node were exhausted, to a total of 16 processes within the
>>> allocation.
>>>
>>>
>>> Ron
>>> ---
>>> Ron Cohen
>>> [email protected]
>>> skypename: ronaldcohen
>>> twitter: @recohen3
>>>
>>>
>>> On Fri, Mar 25, 2016 at 12:38 PM, Ralph Castain <[email protected]> wrote:
>>>
>>> Add -rank-by node to your cmd line. You’ll still get 4 procs/node, but they
>>> will be ranked by node instead of consecutively within a node.
>>>
>>>
>>>
>>> On Mar 25, 2016, at 9:30 AM, Ronald Cohen <[email protected]> wrote:
>>>
>>> I am using
>>>
>>> mpirun  --map-by ppr:4:node -n 16
>>>
>>> and this loads the processes in round robin fashion. This seems to be
>>> twice as slow for my code as loading them node by node, 4 processes
>>> per node.
>>>
>>> How can I not load them round robin, but node by node?
>>>
>>> Thanks!
>>>
>>> Ron
>>>
>>>
>>> ---
>>> Ron Cohen
>>> [email protected]
>>> skypename: ronaldcohen
>>> twitter: @recohen3
>>>
>>> ---
>>> Ronald Cohen
>>> Geophysical Laboratory
>>> Carnegie Institution
>>> 5251 Broad Branch Rd., N.W.
>>> Washington, D.C. 20015
>>> _______________________________________________
>>> users mailing list
>>> [email protected]
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/users/2016/03/28828.php
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> [email protected]
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/users/2016/03/28829.php
>>>
>>> _______________________________________________
>>> users mailing list
>>> [email protected]
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/users/2016/03/28830.php
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> [email protected]
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/users/2016/03/28831.php
>>>
>>> _______________________________________________
>>> users mailing list
>>> [email protected]
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/users/2016/03/28832.php
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> [email protected]
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/users/2016/03/28833.php
>>>
>>> _______________________________________________
>>> users mailing list
>>> [email protected]
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/users/2016/03/28837.php
>>>
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> [email protected]
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/users/2016/03/28840.php
>>>
>>> _______________________________________________
>>> users mailing list
>>> [email protected]
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/users/2016/03/28843.php
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> [email protected]
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/users/2016/03/28844.php
>>>
>>> _______________________________________________
>>> users mailing list
>>> [email protected]
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/users/2016/03/28846.php
>>>
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> [email protected]
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/users/2016/03/28847.php

Re: [OMPI users] loading processes per node

Reply via email to