Re: [OMPI users] loading processes per node

Ronald Cohen Fri, 25 Mar 2016 15:53:12 -0400 (EDT)

Should
-bind-to-core
also help? Does the error I get matter? Should we install libnumactl
and libnumactl-devel packages. ? Thanks!


Ron

---
Ron Cohen
recoh...@gmail.com
skypename: ronaldcohen
twitter: @recohen3


On Fri, Mar 25, 2016 at 3:43 PM, Ralph Castain <r...@open-mpi.org> wrote:
> Yeah, it can really have an impact! It is unfortunately highly
> application-specific, so all we can do is provide the tools.
>
> As you can see from the binding map, we are tight packing the procs on each
> node to maximize the use of shared memory. However, this assumes that each
> rank is predominantly going to “talk” to rank+/-1 - i.e., the pattern
> involves nearest neighboring ranks. If that isn’t true (e.g., the lowest
> ranked process on one node talks to the the lowest ranked process on the
> next node, etc.), then this would be a bad mapping for performance.
>
> In that case, you can use the “rank-by” option to maintain the location and
> binding, but change the assigned MCW ranks to align with your communication
> pattern.
>
> HTH
> Ralph
>
>
>
> On Mar 25, 2016, at 12:28 PM, Ronald Cohen <recoh...@gmail.com> wrote:
>
> So I have been experimenting with different mappings, and performance
> varies a lot. The best I find is:
> -map-by slot:pe=2  -np 32
> with 2 threads
> which gives
> [n001.cluster.com:29647] MCW rank 0 bound to socket 0[core 0[hwt 0]],
> socket 0[core 1[hwt 0]]: [B/B/./././././.][./././././././.]
> [n001.cluster.com:29647] MCW rank 1 bound to socket 0[core 2[hwt 0]],
> socket 0[core 3[hwt 0]]: [././B/B/./././.][./././././././.]
> [n001.cluster.com:29647] MCW rank 2 bound to socket 0[core 4[hwt 0]],
> socket 0[core 5[hwt 0]]: [././././B/B/./.][./././././././.]
> [n001.cluster.com:29647] MCW rank 3 bound to socket 0[core 6[hwt 0]],
> socket 0[core 7[hwt 0]]: [././././././B/B][./././././././.]
> [n001.cluster.com:29647] MCW rank 4 bound to socket 1[core 8[hwt 0]],
> socket 1[core 9[hwt 0]]: [./././././././.][B/B/./././././.]
> [n001.cluster.com:29647] MCW rank 5 bound to socket 1[core 10[hwt 0]],
> socket 1[core 11[hwt 0]]: [./././././././.][././B/B/./././.]
> [n001.cluster.com:29647] MCW rank 6 bound to socket 1[core 12[hwt 0]],
> socket 1[core 13[hwt 0]]: [./././././././.][././././B/B/./.]
> [n001.cluster.com:29647] MCW rank 7 bound to socket 1[core 14[hwt 0]],
> socket 1[core 15[hwt 0]]: [./././././././.][././././././B/B]
> [n003.cluster.com:29842] MCW rank 16 bound to socket 0[core 0[hwt 0]],
> socket 0[core 1[hwt 0]]: [B/B/./././././.][./././././././.]
> [n002.cluster.com:32210] MCW ra
> ...
>
> ---
> Ron Cohen
> recoh...@gmail.com
> skypename: ronaldcohen
> twitter: @recohen3
>
>
> On Fri, Mar 25, 2016 at 3:13 PM, Ronald Cohen <recoh...@gmail.com> wrote:
>
> So
> -map-by node:pe=2  -np 32
> runs and gives great performance, though a little worse than -n 32
> it puts the correct number of processes, but does do round robin. Is
> there a way to do this without the round robin? Also note the error
> message:
>
>
> ======================   ALLOCATED NODES   ======================
>        n001: slots=16 max_slots=0 slots_inuse=0 state=UP
>        n004.cluster.com: slots=16 max_slots=0 slots_inuse=0 state=UP
>        n003.cluster.com: slots=16 max_slots=0 slots_inuse=0 state=UP
>        n002.cluster.com: slots=16 max_slots=0 slots_inuse=0 state=UP
> =================================================================
> --------------------------------------------------------------------------
> WARNING: a request was made to bind a process. While the system
> supports binding the process itself, at least one node does NOT
> support binding memory to the process location.
>
>  Node:  n001
>
> This usually is due to not having the required NUMA support installed
> on the node. In some Linux distributions, the required support is
> contained in the libnumactl and libnumactl-devel packages.
> This is a warning only; your job will continue, though performance may
> be degraded.
> --------------------------------------------------------------------------
> [n001.cluster.com:29316] MCW rank 0 bound to socket 0[core 0[hwt 0]],
> socket 0[core 1[hwt 0]]: [B/B/./././././.][./././././././.]
> [n001.cluster.com:29316] MCW rank 4 bound to socket 0[core 2[hwt 0]],
> socket 0[core 3[hwt 0]]: [././B/B/./././.][./././././././.]
> [n001.cluster.com:29316] MCW rank 8 bound to socket 0[core 4[hwt 0]],
> socket 0[core 5[hwt 0]]: [././././B/B/./.][./././././././.]
> [n001.cluster.com:29316] MCW rank 12 bound to socket 0[core 6[hwt 0]],
> socket 0[core 7[hwt 0]]: [././././././B/B][./././././././.]
> [n001.cluster.com:29316] MCW rank 16 bound to socket 1[core 8[hwt 0]],
> socket 1[core 9[hwt 0]]: [./././././././.][B/B/./././././.]
> [n001.cluster.com:29316] MCW rank 20 bound to socket 1[core 10[hwt
> 0]], socket 1[core 11[hwt 0]]: [./././././././.][././B/B/./././.]
> [n001.cluster.com:29316] MCW rank 24 bound to socket 1[core 12[hwt
> 0]], socket 1[core 13[hwt 0]]: [./././././././.][././././B/B/./.]
> [n001.cluster.com:29316] MCW rank 28 bound to socket 1[core 14[hwt
> 0]], socket 1[core 15[hwt 0]]: [./././././././.][././././././B/B]
> [n003.cluster.com:29704] MCW rank 22 bound to socket 1[core 10[hwt
> 0]], socket 1[core 11[hwt 0]]: [./././././././.][././B/B/./././.]
> ---
> Ron Cohen
> recoh...@gmail.com
> skypename: ronaldcohen
> twitter: @recohen3
>
>
> On Fri, Mar 25, 2016 at 2:32 PM, Ronald Cohen <recoh...@gmail.com> wrote:
>
> So it seems my
> -map-by core:pe=2 -n 32
> should have worked . I would have 32 procs with 2 on each, giving 64 total.
> But it doesn't
> ---
> Ron Cohen
> recoh...@gmail.com
> skypename: ronaldcohen
> twitter: @recohen3
>
>
> On Fri, Mar 25, 2016 at 2:19 PM, Ralph Castain <r...@open-mpi.org> wrote:
>
> pe=N tells us to map N cores (we call them “processing elements” because
> they could be HTs if you —use-hwthreads-as-cpus) to each process. So we will
> bind each process to N cores.
>
> So if you want 16 procs, each with two processing elements assigned to them
> (which is a good choice if you are using 2 threads/process), then you would
> use:
>
> mpirun -map-by core:pe=2 -np 16
>
> If you add -report-bindings, you’ll see each process bound to two cores,
> with the procs tightly packed on each node until that node’s cores are fully
> utilized. We do handle the unlikely event that you asked for a non-integer
> multiple of cores - i.e., if you have 32 cores on a node, and you ask for
> pe=6, we will wind up leaving two cores idle.
>
> HTH
> Ralph
>
> On Mar 25, 2016, at 11:11 AM, Ronald Cohen <recoh...@gmail.com> wrote:
>
> or is it mpirun -map-by core:pe=8 -n 16 ?
>
> ---
> Ron Cohen
> recoh...@gmail.com
> skypename: ronaldcohen
> twitter: @recohen3
>
>
> On Fri, Mar 25, 2016 at 2:10 PM, Ronald Cohen <recoh...@gmail.com> wrote:
>
> Thank you--I looked on the man page and it is not clear to me what
> pe=2 does. Is that the number of threads? So if I want 16 mpi procs
> with 2 threads is it for 32 cores (two nodes)
>
> mpirun -map-by core:pe=2 -n 16
>
> ?
>
> Sorry if I mangled this.
>
>
> Ron
>
> ---
> Ron Cohen
> recoh...@gmail.com
> skypename: ronaldcohen
> twitter: @recohen3
>
>
> On Fri, Mar 25, 2016 at 2:03 PM, Ralph Castain <r...@open-mpi.org> wrote:
>
> Okay, what I would suggest is that you use the following cmd line:
>
> mpirun -map-by core:pe=2 (or 8 or whatever number you want)
>
> This should give you the best performance as it will tight-pack the procs
> and assign them to the correct number of cores. See if that helps
>
> On Mar 25, 2016, at 10:38 AM, Ronald Cohen <recoh...@gmail.com> wrote:
>
> 1.10.2
>
> Ron
>
> ---
> Ron Cohen
> recoh...@gmail.com
> skypename: ronaldcohen
> twitter: @recohen3
>
>
> On Fri, Mar 25, 2016 at 1:30 PM, Ralph Castain <r...@open-mpi.org> wrote:
>
> Hmmm…what version of OMPI are you using?
>
>
> On Mar 25, 2016, at 10:27 AM, Ronald Cohen <recoh...@gmail.com> wrote:
>
> --report-bindings didn't report anything
> ---
> Ron Cohen
> recoh...@gmail.com
> skypename: ronaldcohen
> twitter: @recohen3
>
>
> On Fri, Mar 25, 2016 at 1:24 PM, Ronald Cohen <recoh...@gmail.com> wrote:
>
> —display-allocation an
> didn't seem to give useful information:
>
> ======================   ALLOCATED NODES   ======================
>     n005: slots=16 max_slots=0 slots_inuse=0 state=UP
>     n008.cluster.com: slots=16 max_slots=0 slots_inuse=0 state=UP
>     n007.cluster.com: slots=16 max_slots=0 slots_inuse=0 state=UP
>     n006.cluster.com: slots=16 max_slots=0 slots_inuse=0 state=UP
> =================================================================
>
> for
> mpirun -display-allocation  --map-by ppr:8:node -n 32
>
> Ron
>
> ---
> Ron Cohen
> recoh...@gmail.com
> skypename: ronaldcohen
> twitter: @recohen3
>
>
> On Fri, Mar 25, 2016 at 1:17 PM, Ronald Cohen <recoh...@gmail.com> wrote:
>
> Actually there was the same number of procs per node in each case. I
> verified this by logging into the nodes while they were running--in
> both cases 4 per node .
>
> Ron
>
> ---
> Ron Cohen
> recoh...@gmail.com
> skypename: ronaldcohen
> twitter: @recohen3
>
>
> On Fri, Mar 25, 2016 at 1:14 PM, Ralph Castain <r...@open-mpi.org> wrote:
>
>
> On Mar 25, 2016, at 9:59 AM, Ronald Cohen <recoh...@gmail.com> wrote:
>
> It is very strange but my program runs slower with any of these
> choices than if IO simply use:
>
> mpirun  -n 16
> with
> #PBS -l
> nodes=n013.cluster.com:ppn=4+n014.cluster.com:ppn=4+n015.cluster.com:ppn=4+n016.cluster.com:ppn=4
> for example.
>
>
> This command will tightly pack as many procs as possible on a node - note
> that we may well not see the PBS directives regarding number of ppn. Add
> —display-allocation and let’s see how many slots we think were assigned on
> each node
>
>
> The timing for the latter is 165 seconds, and for
> #PBS -l nodes=4:ppn=16,pmem=1gb
> mpirun  --map-by ppr:4:node -n 16
> it is 368 seconds.
>
>
> It will typically be faster if you pack more procs/node as they can use
> shared memory for communication.
>
>
> Ron
>
> ---
> Ron Cohen
> recoh...@gmail.com
> skypename: ronaldcohen
> twitter: @recohen3
>
>
> On Fri, Mar 25, 2016 at 12:43 PM, Ralph Castain <r...@open-mpi.org> wrote:
>
>
> On Mar 25, 2016, at 9:40 AM, Ronald Cohen <recoh...@gmail.com> wrote:
>
> Thank you! I will try it!
>
>
> What would
> -cpus-per-proc  4 -n 16
> do?
>
>
> This would bind each process to 4 cores, filling each node with procs until
> the cores on that node were exhausted, to a total of 16 processes within the
> allocation.
>
>
> Ron
> ---
> Ron Cohen
> recoh...@gmail.com
> skypename: ronaldcohen
> twitter: @recohen3
>
>
> On Fri, Mar 25, 2016 at 12:38 PM, Ralph Castain <r...@open-mpi.org> wrote:
>
> Add -rank-by node to your cmd line. You’ll still get 4 procs/node, but they
> will be ranked by node instead of consecutively within a node.
>
>
>
> On Mar 25, 2016, at 9:30 AM, Ronald Cohen <recoh...@gmail.com> wrote:
>
> I am using
>
> mpirun  --map-by ppr:4:node -n 16
>
> and this loads the processes in round robin fashion. This seems to be
> twice as slow for my code as loading them node by node, 4 processes
> per node.
>
> How can I not load them round robin, but node by node?
>
> Thanks!
>
> Ron
>
>
> ---
> Ron Cohen
> recoh...@gmail.com
> skypename: ronaldcohen
> twitter: @recohen3
>
> ---
> Ronald Cohen
> Geophysical Laboratory
> Carnegie Institution
> 5251 Broad Branch Rd., N.W.
> Washington, D.C. 20015
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/03/28828.php
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/03/28829.php
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/03/28830.php
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/03/28831.php
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/03/28832.php
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/03/28833.php
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/03/28837.php
>
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/03/28840.php
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/03/28843.php
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/03/28844.php
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/03/28846.php
>
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/03/28847.php
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/03/28851.php
>
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/03/28852.php

Re: [OMPI users] loading processes per node

Reply via email to