Re: [OMPI users] loading processes per node

Ralph Castain Fri, 25 Mar 2016 15:56:14 -0400 (EDT)

> On Mar 25, 2016, at 12:53 PM, Ronald Cohen <recoh...@gmail.com> wrote:
> 
> Should
> -bind-to-core
> also help?


No - if you specify pe=N, then you will automatically bind to core

> Does the error I get matter? Should we install libnumactl
> and libnumactl-devel packages. ? Thanks!

Yes! The warning you are getting is telling you that memory may not be bound 
local to your process - which really can hurt performance.

> 
> Ron
> 
> ---
> Ron Cohen
> recoh...@gmail.com
> skypename: ronaldcohen
> twitter: @recohen3
> 
> 
> On Fri, Mar 25, 2016 at 3:43 PM, Ralph Castain <r...@open-mpi.org> wrote:
>> Yeah, it can really have an impact! It is unfortunately highly
>> application-specific, so all we can do is provide the tools.
>> 
>> As you can see from the binding map, we are tight packing the procs on each
>> node to maximize the use of shared memory. However, this assumes that each
>> rank is predominantly going to “talk” to rank+/-1 - i.e., the pattern
>> involves nearest neighboring ranks. If that isn’t true (e.g., the lowest
>> ranked process on one node talks to the the lowest ranked process on the
>> next node, etc.), then this would be a bad mapping for performance.
>> 
>> In that case, you can use the “rank-by” option to maintain the location and
>> binding, but change the assigned MCW ranks to align with your communication
>> pattern.
>> 
>> HTH
>> Ralph
>> 
>> 
>> 
>> On Mar 25, 2016, at 12:28 PM, Ronald Cohen <recoh...@gmail.com> wrote:
>> 
>> So I have been experimenting with different mappings, and performance
>> varies a lot. The best I find is:
>> -map-by slot:pe=2  -np 32
>> with 2 threads
>> which gives
>> [n001.cluster.com:29647] MCW rank 0 bound to socket 0[core 0[hwt 0]],
>> socket 0[core 1[hwt 0]]: [B/B/./././././.][./././././././.]
>> [n001.cluster.com:29647] MCW rank 1 bound to socket 0[core 2[hwt 0]],
>> socket 0[core 3[hwt 0]]: [././B/B/./././.][./././././././.]
>> [n001.cluster.com:29647] MCW rank 2 bound to socket 0[core 4[hwt 0]],
>> socket 0[core 5[hwt 0]]: [././././B/B/./.][./././././././.]
>> [n001.cluster.com:29647] MCW rank 3 bound to socket 0[core 6[hwt 0]],
>> socket 0[core 7[hwt 0]]: [././././././B/B][./././././././.]
>> [n001.cluster.com:29647] MCW rank 4 bound to socket 1[core 8[hwt 0]],
>> socket 1[core 9[hwt 0]]: [./././././././.][B/B/./././././.]
>> [n001.cluster.com:29647] MCW rank 5 bound to socket 1[core 10[hwt 0]],
>> socket 1[core 11[hwt 0]]: [./././././././.][././B/B/./././.]
>> [n001.cluster.com:29647] MCW rank 6 bound to socket 1[core 12[hwt 0]],
>> socket 1[core 13[hwt 0]]: [./././././././.][././././B/B/./.]
>> [n001.cluster.com:29647] MCW rank 7 bound to socket 1[core 14[hwt 0]],
>> socket 1[core 15[hwt 0]]: [./././././././.][././././././B/B]
>> [n003.cluster.com:29842] MCW rank 16 bound to socket 0[core 0[hwt 0]],
>> socket 0[core 1[hwt 0]]: [B/B/./././././.][./././././././.]
>> [n002.cluster.com:32210] MCW ra
>> ...
>> 
>> ---
>> Ron Cohen
>> recoh...@gmail.com
>> skypename: ronaldcohen
>> twitter: @recohen3
>> 
>> 
>> On Fri, Mar 25, 2016 at 3:13 PM, Ronald Cohen <recoh...@gmail.com> wrote:
>> 
>> So
>> -map-by node:pe=2  -np 32
>> runs and gives great performance, though a little worse than -n 32
>> it puts the correct number of processes, but does do round robin. Is
>> there a way to do this without the round robin? Also note the error
>> message:
>> 
>> 
>> ======================   ALLOCATED NODES   ======================
>>       n001: slots=16 max_slots=0 slots_inuse=0 state=UP
>>       n004.cluster.com: slots=16 max_slots=0 slots_inuse=0 state=UP
>>       n003.cluster.com: slots=16 max_slots=0 slots_inuse=0 state=UP
>>       n002.cluster.com: slots=16 max_slots=0 slots_inuse=0 state=UP
>> =================================================================
>> --------------------------------------------------------------------------
>> WARNING: a request was made to bind a process. While the system
>> supports binding the process itself, at least one node does NOT
>> support binding memory to the process location.
>> 
>> Node:  n001
>> 
>> This usually is due to not having the required NUMA support installed
>> on the node. In some Linux distributions, the required support is
>> contained in the libnumactl and libnumactl-devel packages.
>> This is a warning only; your job will continue, though performance may
>> be degraded.
>> --------------------------------------------------------------------------
>> [n001.cluster.com:29316] MCW rank 0 bound to socket 0[core 0[hwt 0]],
>> socket 0[core 1[hwt 0]]: [B/B/./././././.][./././././././.]
>> [n001.cluster.com:29316] MCW rank 4 bound to socket 0[core 2[hwt 0]],
>> socket 0[core 3[hwt 0]]: [././B/B/./././.][./././././././.]
>> [n001.cluster.com:29316] MCW rank 8 bound to socket 0[core 4[hwt 0]],
>> socket 0[core 5[hwt 0]]: [././././B/B/./.][./././././././.]
>> [n001.cluster.com:29316] MCW rank 12 bound to socket 0[core 6[hwt 0]],
>> socket 0[core 7[hwt 0]]: [././././././B/B][./././././././.]
>> [n001.cluster.com:29316] MCW rank 16 bound to socket 1[core 8[hwt 0]],
>> socket 1[core 9[hwt 0]]: [./././././././.][B/B/./././././.]
>> [n001.cluster.com:29316] MCW rank 20 bound to socket 1[core 10[hwt
>> 0]], socket 1[core 11[hwt 0]]: [./././././././.][././B/B/./././.]
>> [n001.cluster.com:29316] MCW rank 24 bound to socket 1[core 12[hwt
>> 0]], socket 1[core 13[hwt 0]]: [./././././././.][././././B/B/./.]
>> [n001.cluster.com:29316] MCW rank 28 bound to socket 1[core 14[hwt
>> 0]], socket 1[core 15[hwt 0]]: [./././././././.][././././././B/B]
>> [n003.cluster.com:29704] MCW rank 22 bound to socket 1[core 10[hwt
>> 0]], socket 1[core 11[hwt 0]]: [./././././././.][././B/B/./././.]
>> ---
>> Ron Cohen
>> recoh...@gmail.com
>> skypename: ronaldcohen
>> twitter: @recohen3
>> 
>> 
>> On Fri, Mar 25, 2016 at 2:32 PM, Ronald Cohen <recoh...@gmail.com> wrote:
>> 
>> So it seems my
>> -map-by core:pe=2 -n 32
>> should have worked . I would have 32 procs with 2 on each, giving 64 total.
>> But it doesn't
>> ---
>> Ron Cohen
>> recoh...@gmail.com
>> skypename: ronaldcohen
>> twitter: @recohen3
>> 
>> 
>> On Fri, Mar 25, 2016 at 2:19 PM, Ralph Castain <r...@open-mpi.org> wrote:
>> 
>> pe=N tells us to map N cores (we call them “processing elements” because
>> they could be HTs if you —use-hwthreads-as-cpus) to each process. So we will
>> bind each process to N cores.
>> 
>> So if you want 16 procs, each with two processing elements assigned to them
>> (which is a good choice if you are using 2 threads/process), then you would
>> use:
>> 
>> mpirun -map-by core:pe=2 -np 16
>> 
>> If you add -report-bindings, you’ll see each process bound to two cores,
>> with the procs tightly packed on each node until that node’s cores are fully
>> utilized. We do handle the unlikely event that you asked for a non-integer
>> multiple of cores - i.e., if you have 32 cores on a node, and you ask for
>> pe=6, we will wind up leaving two cores idle.
>> 
>> HTH
>> Ralph
>> 
>> On Mar 25, 2016, at 11:11 AM, Ronald Cohen <recoh...@gmail.com> wrote:
>> 
>> or is it mpirun -map-by core:pe=8 -n 16 ?
>> 
>> ---
>> Ron Cohen
>> recoh...@gmail.com
>> skypename: ronaldcohen
>> twitter: @recohen3
>> 
>> 
>> On Fri, Mar 25, 2016 at 2:10 PM, Ronald Cohen <recoh...@gmail.com> wrote:
>> 
>> Thank you--I looked on the man page and it is not clear to me what
>> pe=2 does. Is that the number of threads? So if I want 16 mpi procs
>> with 2 threads is it for 32 cores (two nodes)
>> 
>> mpirun -map-by core:pe=2 -n 16
>> 
>> ?
>> 
>> Sorry if I mangled this.
>> 
>> 
>> Ron
>> 
>> ---
>> Ron Cohen
>> recoh...@gmail.com
>> skypename: ronaldcohen
>> twitter: @recohen3
>> 
>> 
>> On Fri, Mar 25, 2016 at 2:03 PM, Ralph Castain <r...@open-mpi.org> wrote:
>> 
>> Okay, what I would suggest is that you use the following cmd line:
>> 
>> mpirun -map-by core:pe=2 (or 8 or whatever number you want)
>> 
>> This should give you the best performance as it will tight-pack the procs
>> and assign them to the correct number of cores. See if that helps
>> 
>> On Mar 25, 2016, at 10:38 AM, Ronald Cohen <recoh...@gmail.com> wrote:
>> 
>> 1.10.2
>> 
>> Ron
>> 
>> ---
>> Ron Cohen
>> recoh...@gmail.com
>> skypename: ronaldcohen
>> twitter: @recohen3
>> 
>> 
>> On Fri, Mar 25, 2016 at 1:30 PM, Ralph Castain <r...@open-mpi.org> wrote:
>> 
>> Hmmm…what version of OMPI are you using?
>> 
>> 
>> On Mar 25, 2016, at 10:27 AM, Ronald Cohen <recoh...@gmail.com> wrote:
>> 
>> --report-bindings didn't report anything
>> ---
>> Ron Cohen
>> recoh...@gmail.com
>> skypename: ronaldcohen
>> twitter: @recohen3
>> 
>> 
>> On Fri, Mar 25, 2016 at 1:24 PM, Ronald Cohen <recoh...@gmail.com> wrote:
>> 
>> —display-allocation an
>> didn't seem to give useful information:
>> 
>> ======================   ALLOCATED NODES   ======================
>>    n005: slots=16 max_slots=0 slots_inuse=0 state=UP
>>    n008.cluster.com: slots=16 max_slots=0 slots_inuse=0 state=UP
>>    n007.cluster.com: slots=16 max_slots=0 slots_inuse=0 state=UP
>>    n006.cluster.com: slots=16 max_slots=0 slots_inuse=0 state=UP
>> =================================================================
>> 
>> for
>> mpirun -display-allocation  --map-by ppr:8:node -n 32
>> 
>> Ron
>> 
>> ---
>> Ron Cohen
>> recoh...@gmail.com
>> skypename: ronaldcohen
>> twitter: @recohen3
>> 
>> 
>> On Fri, Mar 25, 2016 at 1:17 PM, Ronald Cohen <recoh...@gmail.com> wrote:
>> 
>> Actually there was the same number of procs per node in each case. I
>> verified this by logging into the nodes while they were running--in
>> both cases 4 per node .
>> 
>> Ron
>> 
>> ---
>> Ron Cohen
>> recoh...@gmail.com
>> skypename: ronaldcohen
>> twitter: @recohen3
>> 
>> 
>> On Fri, Mar 25, 2016 at 1:14 PM, Ralph Castain <r...@open-mpi.org> wrote:
>> 
>> 
>> On Mar 25, 2016, at 9:59 AM, Ronald Cohen <recoh...@gmail.com> wrote:
>> 
>> It is very strange but my program runs slower with any of these
>> choices than if IO simply use:
>> 
>> mpirun  -n 16
>> with
>> #PBS -l
>> nodes=n013.cluster.com:ppn=4+n014.cluster.com:ppn=4+n015.cluster.com:ppn=4+n016.cluster.com:ppn=4
>> for example.
>> 
>> 
>> This command will tightly pack as many procs as possible on a node - note
>> that we may well not see the PBS directives regarding number of ppn. Add
>> —display-allocation and let’s see how many slots we think were assigned on
>> each node
>> 
>> 
>> The timing for the latter is 165 seconds, and for
>> #PBS -l nodes=4:ppn=16,pmem=1gb
>> mpirun  --map-by ppr:4:node -n 16
>> it is 368 seconds.
>> 
>> 
>> It will typically be faster if you pack more procs/node as they can use
>> shared memory for communication.
>> 
>> 
>> Ron
>> 
>> ---
>> Ron Cohen
>> recoh...@gmail.com
>> skypename: ronaldcohen
>> twitter: @recohen3
>> 
>> 
>> On Fri, Mar 25, 2016 at 12:43 PM, Ralph Castain <r...@open-mpi.org> wrote:
>> 
>> 
>> On Mar 25, 2016, at 9:40 AM, Ronald Cohen <recoh...@gmail.com> wrote:
>> 
>> Thank you! I will try it!
>> 
>> 
>> What would
>> -cpus-per-proc  4 -n 16
>> do?
>> 
>> 
>> This would bind each process to 4 cores, filling each node with procs until
>> the cores on that node were exhausted, to a total of 16 processes within the
>> allocation.
>> 
>> 
>> Ron
>> ---
>> Ron Cohen
>> recoh...@gmail.com
>> skypename: ronaldcohen
>> twitter: @recohen3
>> 
>> 
>> On Fri, Mar 25, 2016 at 12:38 PM, Ralph Castain <r...@open-mpi.org> wrote:
>> 
>> Add -rank-by node to your cmd line. You’ll still get 4 procs/node, but they
>> will be ranked by node instead of consecutively within a node.
>> 
>> 
>> 
>> On Mar 25, 2016, at 9:30 AM, Ronald Cohen <recoh...@gmail.com> wrote:
>> 
>> I am using
>> 
>> mpirun  --map-by ppr:4:node -n 16
>> 
>> and this loads the processes in round robin fashion. This seems to be
>> twice as slow for my code as loading them node by node, 4 processes
>> per node.
>> 
>> How can I not load them round robin, but node by node?
>> 
>> Thanks!
>> 
>> Ron
>> 
>> 
>> ---
>> Ron Cohen
>> recoh...@gmail.com
>> skypename: ronaldcohen
>> twitter: @recohen3
>> 
>> ---
>> Ronald Cohen
>> Geophysical Laboratory
>> Carnegie Institution
>> 5251 Broad Branch Rd., N.W.
>> Washington, D.C. 20015
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2016/03/28828.php
>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2016/03/28829.php
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2016/03/28830.php
>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2016/03/28831.php
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2016/03/28832.php
>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2016/03/28833.php
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2016/03/28837.php
>> 
>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2016/03/28840.php
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2016/03/28843.php
>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2016/03/28844.php
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2016/03/28846.php
>> 
>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2016/03/28847.php
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2016/03/28851.php
>> 
>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2016/03/28852.php
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2016/03/28853.php

Re: [OMPI users] loading processes per node

Reply via email to