Re: [OMPI users] loading processes per node

Ronald Cohen Fri, 25 Mar 2016 15:13:19 -0400 (EDT)

So
 -map-by node:pe=2  -np 32
runs and gives great performance, though a little worse than -n 32
it puts the correct number of processes, but does do round robin. Is
there a way to do this without the round robin? Also note the error
message:



======================   ALLOCATED NODES   ======================
        n001: slots=16 max_slots=0 slots_inuse=0 state=UP
        n004.cluster.com: slots=16 max_slots=0 slots_inuse=0 state=UP
        n003.cluster.com: slots=16 max_slots=0 slots_inuse=0 state=UP
        n002.cluster.com: slots=16 max_slots=0 slots_inuse=0 state=UP
=================================================================
--------------------------------------------------------------------------
WARNING: a request was made to bind a process. While the system
supports binding the process itself, at least one node does NOT
support binding memory to the process location.

  Node:  n001

This usually is due to not having the required NUMA support installed
on the node. In some Linux distributions, the required support is
contained in the libnumactl and libnumactl-devel packages.
This is a warning only; your job will continue, though performance may
be degraded.
--------------------------------------------------------------------------
[n001.cluster.com:29316] MCW rank 0 bound to socket 0[core 0[hwt 0]],
socket 0[core 1[hwt 0]]: [B/B/./././././.][./././././././.]
[n001.cluster.com:29316] MCW rank 4 bound to socket 0[core 2[hwt 0]],
socket 0[core 3[hwt 0]]: [././B/B/./././.][./././././././.]
[n001.cluster.com:29316] MCW rank 8 bound to socket 0[core 4[hwt 0]],
socket 0[core 5[hwt 0]]: [././././B/B/./.][./././././././.]
[n001.cluster.com:29316] MCW rank 12 bound to socket 0[core 6[hwt 0]],
socket 0[core 7[hwt 0]]: [././././././B/B][./././././././.]
[n001.cluster.com:29316] MCW rank 16 bound to socket 1[core 8[hwt 0]],
socket 1[core 9[hwt 0]]: [./././././././.][B/B/./././././.]
[n001.cluster.com:29316] MCW rank 20 bound to socket 1[core 10[hwt
0]], socket 1[core 11[hwt 0]]: [./././././././.][././B/B/./././.]
[n001.cluster.com:29316] MCW rank 24 bound to socket 1[core 12[hwt
0]], socket 1[core 13[hwt 0]]: [./././././././.][././././B/B/./.]
[n001.cluster.com:29316] MCW rank 28 bound to socket 1[core 14[hwt
0]], socket 1[core 15[hwt 0]]: [./././././././.][././././././B/B]
[n003.cluster.com:29704] MCW rank 22 bound to socket 1[core 10[hwt
0]], socket 1[core 11[hwt 0]]: [./././././././.][././B/B/./././.]
---
Ron Cohen
[email protected]
skypename: ronaldcohen
twitter: @recohen3


On Fri, Mar 25, 2016 at 2:32 PM, Ronald Cohen <[email protected]> wrote:
> So it seems my
> -map-by core:pe=2 -n 32
> should have worked . I would have 32 procs with 2 on each, giving 64 total.
> But it doesn't
> ---
> Ron Cohen
> [email protected]
> skypename: ronaldcohen
> twitter: @recohen3
>
>
> On Fri, Mar 25, 2016 at 2:19 PM, Ralph Castain <[email protected]> wrote:
>> pe=N tells us to map N cores (we call them “processing elements” because
>> they could be HTs if you —use-hwthreads-as-cpus) to each process. So we will
>> bind each process to N cores.
>>
>> So if you want 16 procs, each with two processing elements assigned to them
>> (which is a good choice if you are using 2 threads/process), then you would
>> use:
>>
>> mpirun -map-by core:pe=2 -np 16
>>
>> If you add -report-bindings, you’ll see each process bound to two cores,
>> with the procs tightly packed on each node until that node’s cores are fully
>> utilized. We do handle the unlikely event that you asked for a non-integer
>> multiple of cores - i.e., if you have 32 cores on a node, and you ask for
>> pe=6, we will wind up leaving two cores idle.
>>
>> HTH
>> Ralph
>>
>> On Mar 25, 2016, at 11:11 AM, Ronald Cohen <[email protected]> wrote:
>>
>> or is it mpirun -map-by core:pe=8 -n 16 ?
>>
>> ---
>> Ron Cohen
>> [email protected]
>> skypename: ronaldcohen
>> twitter: @recohen3
>>
>>
>> On Fri, Mar 25, 2016 at 2:10 PM, Ronald Cohen <[email protected]> wrote:
>>
>> Thank you--I looked on the man page and it is not clear to me what
>> pe=2 does. Is that the number of threads? So if I want 16 mpi procs
>> with 2 threads is it for 32 cores (two nodes)
>>
>> mpirun -map-by core:pe=2 -n 16
>>
>> ?
>>
>> Sorry if I mangled this.
>>
>>
>> Ron
>>
>> ---
>> Ron Cohen
>> [email protected]
>> skypename: ronaldcohen
>> twitter: @recohen3
>>
>>
>> On Fri, Mar 25, 2016 at 2:03 PM, Ralph Castain <[email protected]> wrote:
>>
>> Okay, what I would suggest is that you use the following cmd line:
>>
>> mpirun -map-by core:pe=2 (or 8 or whatever number you want)
>>
>> This should give you the best performance as it will tight-pack the procs
>> and assign them to the correct number of cores. See if that helps
>>
>> On Mar 25, 2016, at 10:38 AM, Ronald Cohen <[email protected]> wrote:
>>
>> 1.10.2
>>
>> Ron
>>
>> ---
>> Ron Cohen
>> [email protected]
>> skypename: ronaldcohen
>> twitter: @recohen3
>>
>>
>> On Fri, Mar 25, 2016 at 1:30 PM, Ralph Castain <[email protected]> wrote:
>>
>> Hmmm…what version of OMPI are you using?
>>
>>
>> On Mar 25, 2016, at 10:27 AM, Ronald Cohen <[email protected]> wrote:
>>
>> --report-bindings didn't report anything
>> ---
>> Ron Cohen
>> [email protected]
>> skypename: ronaldcohen
>> twitter: @recohen3
>>
>>
>> On Fri, Mar 25, 2016 at 1:24 PM, Ronald Cohen <[email protected]> wrote:
>>
>> —display-allocation an
>> didn't seem to give useful information:
>>
>> ======================   ALLOCATED NODES   ======================
>>      n005: slots=16 max_slots=0 slots_inuse=0 state=UP
>>      n008.cluster.com: slots=16 max_slots=0 slots_inuse=0 state=UP
>>      n007.cluster.com: slots=16 max_slots=0 slots_inuse=0 state=UP
>>      n006.cluster.com: slots=16 max_slots=0 slots_inuse=0 state=UP
>> =================================================================
>>
>> for
>> mpirun -display-allocation  --map-by ppr:8:node -n 32
>>
>> Ron
>>
>> ---
>> Ron Cohen
>> [email protected]
>> skypename: ronaldcohen
>> twitter: @recohen3
>>
>>
>> On Fri, Mar 25, 2016 at 1:17 PM, Ronald Cohen <[email protected]> wrote:
>>
>> Actually there was the same number of procs per node in each case. I
>> verified this by logging into the nodes while they were running--in
>> both cases 4 per node .
>>
>> Ron
>>
>> ---
>> Ron Cohen
>> [email protected]
>> skypename: ronaldcohen
>> twitter: @recohen3
>>
>>
>> On Fri, Mar 25, 2016 at 1:14 PM, Ralph Castain <[email protected]> wrote:
>>
>>
>> On Mar 25, 2016, at 9:59 AM, Ronald Cohen <[email protected]> wrote:
>>
>> It is very strange but my program runs slower with any of these
>> choices than if IO simply use:
>>
>> mpirun  -n 16
>> with
>> #PBS -l
>> nodes=n013.cluster.com:ppn=4+n014.cluster.com:ppn=4+n015.cluster.com:ppn=4+n016.cluster.com:ppn=4
>> for example.
>>
>>
>> This command will tightly pack as many procs as possible on a node - note
>> that we may well not see the PBS directives regarding number of ppn. Add
>> —display-allocation and let’s see how many slots we think were assigned on
>> each node
>>
>>
>> The timing for the latter is 165 seconds, and for
>> #PBS -l nodes=4:ppn=16,pmem=1gb
>> mpirun  --map-by ppr:4:node -n 16
>> it is 368 seconds.
>>
>>
>> It will typically be faster if you pack more procs/node as they can use
>> shared memory for communication.
>>
>>
>> Ron
>>
>> ---
>> Ron Cohen
>> [email protected]
>> skypename: ronaldcohen
>> twitter: @recohen3
>>
>>
>> On Fri, Mar 25, 2016 at 12:43 PM, Ralph Castain <[email protected]> wrote:
>>
>>
>> On Mar 25, 2016, at 9:40 AM, Ronald Cohen <[email protected]> wrote:
>>
>> Thank you! I will try it!
>>
>>
>> What would
>> -cpus-per-proc  4 -n 16
>> do?
>>
>>
>> This would bind each process to 4 cores, filling each node with procs until
>> the cores on that node were exhausted, to a total of 16 processes within the
>> allocation.
>>
>>
>> Ron
>> ---
>> Ron Cohen
>> [email protected]
>> skypename: ronaldcohen
>> twitter: @recohen3
>>
>>
>> On Fri, Mar 25, 2016 at 12:38 PM, Ralph Castain <[email protected]> wrote:
>>
>> Add -rank-by node to your cmd line. You’ll still get 4 procs/node, but they
>> will be ranked by node instead of consecutively within a node.
>>
>>
>>
>> On Mar 25, 2016, at 9:30 AM, Ronald Cohen <[email protected]> wrote:
>>
>> I am using
>>
>> mpirun  --map-by ppr:4:node -n 16
>>
>> and this loads the processes in round robin fashion. This seems to be
>> twice as slow for my code as loading them node by node, 4 processes
>> per node.
>>
>> How can I not load them round robin, but node by node?
>>
>> Thanks!
>>
>> Ron
>>
>>
>> ---
>> Ron Cohen
>> [email protected]
>> skypename: ronaldcohen
>> twitter: @recohen3
>>
>> ---
>> Ronald Cohen
>> Geophysical Laboratory
>> Carnegie Institution
>> 5251 Broad Branch Rd., N.W.
>> Washington, D.C. 20015
>> _______________________________________________
>> users mailing list
>> [email protected]
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2016/03/28828.php
>>
>>
>> _______________________________________________
>> users mailing list
>> [email protected]
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2016/03/28829.php
>>
>> _______________________________________________
>> users mailing list
>> [email protected]
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2016/03/28830.php
>>
>>
>> _______________________________________________
>> users mailing list
>> [email protected]
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2016/03/28831.php
>>
>> _______________________________________________
>> users mailing list
>> [email protected]
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2016/03/28832.php
>>
>>
>> _______________________________________________
>> users mailing list
>> [email protected]
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2016/03/28833.php
>>
>> _______________________________________________
>> users mailing list
>> [email protected]
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2016/03/28837.php
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> [email protected]
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2016/03/28840.php
>>
>> _______________________________________________
>> users mailing list
>> [email protected]
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2016/03/28843.php
>>
>>
>> _______________________________________________
>> users mailing list
>> [email protected]
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2016/03/28844.php
>>
>> _______________________________________________
>> users mailing list
>> [email protected]
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2016/03/28846.php
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> [email protected]
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2016/03/28847.php

Re: [OMPI users] loading processes per node

Reply via email to