So -map-by node:pe=2 -np 32 runs and gives great performance, though a little worse than -n 32 it puts the correct number of processes, but does do round robin. Is there a way to do this without the round robin? Also note the error message:
====================== ALLOCATED NODES ====================== n001: slots=16 max_slots=0 slots_inuse=0 state=UP n004.cluster.com: slots=16 max_slots=0 slots_inuse=0 state=UP n003.cluster.com: slots=16 max_slots=0 slots_inuse=0 state=UP n002.cluster.com: slots=16 max_slots=0 slots_inuse=0 state=UP ================================================================= -------------------------------------------------------------------------- WARNING: a request was made to bind a process. While the system supports binding the process itself, at least one node does NOT support binding memory to the process location. Node: n001 This usually is due to not having the required NUMA support installed on the node. In some Linux distributions, the required support is contained in the libnumactl and libnumactl-devel packages. This is a warning only; your job will continue, though performance may be degraded. -------------------------------------------------------------------------- [n001.cluster.com:29316] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]]: [B/B/./././././.][./././././././.] [n001.cluster.com:29316] MCW rank 4 bound to socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]]: [././B/B/./././.][./././././././.] [n001.cluster.com:29316] MCW rank 8 bound to socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [././././B/B/./.][./././././././.] [n001.cluster.com:29316] MCW rank 12 bound to socket 0[core 6[hwt 0]], socket 0[core 7[hwt 0]]: [././././././B/B][./././././././.] [n001.cluster.com:29316] MCW rank 16 bound to socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]]: [./././././././.][B/B/./././././.] [n001.cluster.com:29316] MCW rank 20 bound to socket 1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././././.][././B/B/./././.] [n001.cluster.com:29316] MCW rank 24 bound to socket 1[core 12[hwt 0]], socket 1[core 13[hwt 0]]: [./././././././.][././././B/B/./.] [n001.cluster.com:29316] MCW rank 28 bound to socket 1[core 14[hwt 0]], socket 1[core 15[hwt 0]]: [./././././././.][././././././B/B] [n003.cluster.com:29704] MCW rank 22 bound to socket 1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././././.][././B/B/./././.] --- Ron Cohen recoh...@gmail.com skypename: ronaldcohen twitter: @recohen3 On Fri, Mar 25, 2016 at 2:32 PM, Ronald Cohen <recoh...@gmail.com> wrote: > So it seems my > -map-by core:pe=2 -n 32 > should have worked . I would have 32 procs with 2 on each, giving 64 total. > But it doesn't > --- > Ron Cohen > recoh...@gmail.com > skypename: ronaldcohen > twitter: @recohen3 > > > On Fri, Mar 25, 2016 at 2:19 PM, Ralph Castain <r...@open-mpi.org> wrote: >> pe=N tells us to map N cores (we call them “processing elements” because >> they could be HTs if you —use-hwthreads-as-cpus) to each process. So we will >> bind each process to N cores. >> >> So if you want 16 procs, each with two processing elements assigned to them >> (which is a good choice if you are using 2 threads/process), then you would >> use: >> >> mpirun -map-by core:pe=2 -np 16 >> >> If you add -report-bindings, you’ll see each process bound to two cores, >> with the procs tightly packed on each node until that node’s cores are fully >> utilized. We do handle the unlikely event that you asked for a non-integer >> multiple of cores - i.e., if you have 32 cores on a node, and you ask for >> pe=6, we will wind up leaving two cores idle. >> >> HTH >> Ralph >> >> On Mar 25, 2016, at 11:11 AM, Ronald Cohen <recoh...@gmail.com> wrote: >> >> or is it mpirun -map-by core:pe=8 -n 16 ? >> >> --- >> Ron Cohen >> recoh...@gmail.com >> skypename: ronaldcohen >> twitter: @recohen3 >> >> >> On Fri, Mar 25, 2016 at 2:10 PM, Ronald Cohen <recoh...@gmail.com> wrote: >> >> Thank you--I looked on the man page and it is not clear to me what >> pe=2 does. Is that the number of threads? So if I want 16 mpi procs >> with 2 threads is it for 32 cores (two nodes) >> >> mpirun -map-by core:pe=2 -n 16 >> >> ? >> >> Sorry if I mangled this. >> >> >> Ron >> >> --- >> Ron Cohen >> recoh...@gmail.com >> skypename: ronaldcohen >> twitter: @recohen3 >> >> >> On Fri, Mar 25, 2016 at 2:03 PM, Ralph Castain <r...@open-mpi.org> wrote: >> >> Okay, what I would suggest is that you use the following cmd line: >> >> mpirun -map-by core:pe=2 (or 8 or whatever number you want) >> >> This should give you the best performance as it will tight-pack the procs >> and assign them to the correct number of cores. See if that helps >> >> On Mar 25, 2016, at 10:38 AM, Ronald Cohen <recoh...@gmail.com> wrote: >> >> 1.10.2 >> >> Ron >> >> --- >> Ron Cohen >> recoh...@gmail.com >> skypename: ronaldcohen >> twitter: @recohen3 >> >> >> On Fri, Mar 25, 2016 at 1:30 PM, Ralph Castain <r...@open-mpi.org> wrote: >> >> Hmmm…what version of OMPI are you using? >> >> >> On Mar 25, 2016, at 10:27 AM, Ronald Cohen <recoh...@gmail.com> wrote: >> >> --report-bindings didn't report anything >> --- >> Ron Cohen >> recoh...@gmail.com >> skypename: ronaldcohen >> twitter: @recohen3 >> >> >> On Fri, Mar 25, 2016 at 1:24 PM, Ronald Cohen <recoh...@gmail.com> wrote: >> >> —display-allocation an >> didn't seem to give useful information: >> >> ====================== ALLOCATED NODES ====================== >> n005: slots=16 max_slots=0 slots_inuse=0 state=UP >> n008.cluster.com: slots=16 max_slots=0 slots_inuse=0 state=UP >> n007.cluster.com: slots=16 max_slots=0 slots_inuse=0 state=UP >> n006.cluster.com: slots=16 max_slots=0 slots_inuse=0 state=UP >> ================================================================= >> >> for >> mpirun -display-allocation --map-by ppr:8:node -n 32 >> >> Ron >> >> --- >> Ron Cohen >> recoh...@gmail.com >> skypename: ronaldcohen >> twitter: @recohen3 >> >> >> On Fri, Mar 25, 2016 at 1:17 PM, Ronald Cohen <recoh...@gmail.com> wrote: >> >> Actually there was the same number of procs per node in each case. I >> verified this by logging into the nodes while they were running--in >> both cases 4 per node . >> >> Ron >> >> --- >> Ron Cohen >> recoh...@gmail.com >> skypename: ronaldcohen >> twitter: @recohen3 >> >> >> On Fri, Mar 25, 2016 at 1:14 PM, Ralph Castain <r...@open-mpi.org> wrote: >> >> >> On Mar 25, 2016, at 9:59 AM, Ronald Cohen <recoh...@gmail.com> wrote: >> >> It is very strange but my program runs slower with any of these >> choices than if IO simply use: >> >> mpirun -n 16 >> with >> #PBS -l >> nodes=n013.cluster.com:ppn=4+n014.cluster.com:ppn=4+n015.cluster.com:ppn=4+n016.cluster.com:ppn=4 >> for example. >> >> >> This command will tightly pack as many procs as possible on a node - note >> that we may well not see the PBS directives regarding number of ppn. Add >> —display-allocation and let’s see how many slots we think were assigned on >> each node >> >> >> The timing for the latter is 165 seconds, and for >> #PBS -l nodes=4:ppn=16,pmem=1gb >> mpirun --map-by ppr:4:node -n 16 >> it is 368 seconds. >> >> >> It will typically be faster if you pack more procs/node as they can use >> shared memory for communication. >> >> >> Ron >> >> --- >> Ron Cohen >> recoh...@gmail.com >> skypename: ronaldcohen >> twitter: @recohen3 >> >> >> On Fri, Mar 25, 2016 at 12:43 PM, Ralph Castain <r...@open-mpi.org> wrote: >> >> >> On Mar 25, 2016, at 9:40 AM, Ronald Cohen <recoh...@gmail.com> wrote: >> >> Thank you! I will try it! >> >> >> What would >> -cpus-per-proc 4 -n 16 >> do? >> >> >> This would bind each process to 4 cores, filling each node with procs until >> the cores on that node were exhausted, to a total of 16 processes within the >> allocation. >> >> >> Ron >> --- >> Ron Cohen >> recoh...@gmail.com >> skypename: ronaldcohen >> twitter: @recohen3 >> >> >> On Fri, Mar 25, 2016 at 12:38 PM, Ralph Castain <r...@open-mpi.org> wrote: >> >> Add -rank-by node to your cmd line. You’ll still get 4 procs/node, but they >> will be ranked by node instead of consecutively within a node. >> >> >> >> On Mar 25, 2016, at 9:30 AM, Ronald Cohen <recoh...@gmail.com> wrote: >> >> I am using >> >> mpirun --map-by ppr:4:node -n 16 >> >> and this loads the processes in round robin fashion. This seems to be >> twice as slow for my code as loading them node by node, 4 processes >> per node. >> >> How can I not load them round robin, but node by node? >> >> Thanks! >> >> Ron >> >> >> --- >> Ron Cohen >> recoh...@gmail.com >> skypename: ronaldcohen >> twitter: @recohen3 >> >> --- >> Ronald Cohen >> Geophysical Laboratory >> Carnegie Institution >> 5251 Broad Branch Rd., N.W. >> Washington, D.C. 20015 >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2016/03/28828.php >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2016/03/28829.php >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2016/03/28830.php >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2016/03/28831.php >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2016/03/28832.php >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2016/03/28833.php >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2016/03/28837.php >> >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2016/03/28840.php >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2016/03/28843.php >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2016/03/28844.php >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2016/03/28846.php >> >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2016/03/28847.php