Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

Gilles Gouaillardet Wed, 7 Oct 2015 07:29:13 -0400 (EDT)

Jeff,

there are quite a lot of changes, I did not update master yet (need extra
pairs of eyes to review this...)
so unless you want to make rc2 today and rc3 a week later, it is imho way
safer to wait for v1.10.2


Ralph,
any thoughts ?

Cheers,

Gilles

On Wednesday, October 7, 2015, Jeff Squyres (jsquyres) <jsquy...@cisco.com>
wrote:

> Is this something that needs to go into v1.10.1?
>
> If so, a PR needs to be filed ASAP.  We were supposed to make the next
> 1.10.1 RC yesterday, but slipped to today due to some last second patches.
>
>
> > On Oct 7, 2015, at 4:32 AM, Gilles Gouaillardet <gil...@rist.or.jp
> <javascript:;>> wrote:
> >
> > Marcin,
> >
> > here is a patch for the master, hopefully it fixes all the issues we
> discussed
> > i will make sure it applies fine vs latest 1.10 tarball from tomorrow
> >
> > Cheers,
> >
> > Gilles
> >
> >
> > On 10/6/2015 7:22 PM, marcin.krotkiewski wrote:
> >> Gilles,
> >>
> >> Yes, it seemed that all was fine with binding in the patched 1.10.1rc1
> - thank you. Eagerly waiting for the other patches, let me know and I will
> test them later this week.
> >>
> >> Marcin
> >>
> >>
> >>
> >> On 10/06/2015 12:09 PM, Gilles Gouaillardet wrote:
> >>> Marcin,
> >>>
> >>> my understanding is that in this case, patched v1.10.1rc1 is working
> just fine.
> >>> am I right ?
> >>>
> >>> I prepared two patches
> >>> one to remove the warning when binding on one core if only one core is
> available,
> >>> an other one to add a warning if the user asks a binding policy that
> makes no sense with the required mapping policy
> >>>
> >>> I will finalize them tomorrow hopefully
> >>>
> >>> Cheers,
> >>>
> >>> Gilles
> >>>
> >>> On Tuesday, October 6, 2015, marcin.krotkiewski <
> marcin.krotkiew...@gmail.com <javascript:;>> wrote:
> >>> Hi, Gilles
> >>>> you mentionned you had one failure with 1.10.1rc1 and -bind-to core
> >>>> could you please send the full details (script, allocation and output)
> >>>> in your slurm script, you can do
> >>>> srun -N $SLURM_NNODES -n $SLURM_NNODES --cpu_bind=none -l grep
> Cpus_allowed_list /proc/self/status
> >>>> before invoking mpirun
> >>>>
> >>> It was an interactive job allocated with
> >>>
> >>> salloc --account=staff --ntasks=32 --mem-per-cpu=2G --time=120:0:0
> >>>
> >>> The slurm environment is the following
> >>>
> >>> SLURM_JOBID=12714491
> >>> SLURM_JOB_CPUS_PER_NODE='4,2,5(x2),4,7,5'
> >>> SLURM_JOB_ID=12714491
> >>> SLURM_JOB_NODELIST='c1-[2,4,8,13,16,23,26]'
> >>> SLURM_JOB_NUM_NODES=7
> >>> SLURM_JOB_PARTITION=normal
> >>> SLURM_MEM_PER_CPU=2048
> >>> SLURM_NNODES=7
> >>> SLURM_NODELIST='c1-[2,4,8,13,16,23,26]'
> >>> SLURM_NODE_ALIASES='(null)'
> >>> SLURM_NPROCS=32
> >>> SLURM_NTASKS=32
> >>> SLURM_SUBMIT_DIR=/cluster/home/marcink
> >>> SLURM_SUBMIT_HOST=login-0-1.local
> >>> SLURM_TASKS_PER_NODE='4,2,5(x2),4,7,5'
> >>>
> >>> The output of the command you asked for is
> >>>
> >>> 0: c1-2.local  Cpus_allowed_list:        1-4,17-20
> >>> 1: c1-4.local  Cpus_allowed_list:        1,15,17,31
> >>> 2: c1-8.local  Cpus_allowed_list:        0,5,9,13-14,16,21,25,29-30
> >>> 3: c1-13.local  Cpus_allowed_list:       3-7,19-23
> >>> 4: c1-16.local  Cpus_allowed_list:       12-15,28-31
> >>> 5: c1-23.local  Cpus_allowed_list:       2-4,8,13-15,18-20,24,29-31
> >>> 6: c1-26.local  Cpus_allowed_list:       1,6,11,13,15,17,22,27,29,31
> >>>
> >>> Running with command
> >>>
> >>> mpirun --mca rmaps_base_verbose 10 --hetero-nodes --bind-to core
> --report-bindings --map-by socket -np 32 ./affinity
> >>>
> >>> I have attached two output files: one for the original 1.10.1rc1, one
> for the patched version.
> >>>
> >>> When I said 'failed in one case' I was not precise. I got an error on
> node c1-8, which was the first one to have different number of MPI
> processes on the two sockets. It would also fail on some later nodes, just
>                that because of the error we never got there.
> >>>
> >>> Let me know if you need more.
> >>>
> >>> Marcin
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>> Cheers,
> >>>>
> >>>> Gilles
> >>>>
> >>>> On 10/4/2015 11:55 PM, marcin.krotkiewski wrote:
> >>>>> Hi, all,
> >>>>>
> >>>>> I played a bit more and it seems that the problem results from
> >>>>>
> >>>>> trg_obj = opal_hwloc_base_find_min_bound_target_under_obj()
> >>>>>
> >>>>> called in rmaps_base_binding.c / bind_downwards being wrong. I do
> not know the reason, but I think I know when the problem happens (at least
> on 1.10.1rc1). It seems that by default openmpi maps by socket. The error
> happens when for a given compute node there is a different number of cores
> used on each socket. Consider previously studied case (the debug outputs I
> sent in last post). c1-8, which was source of error, has 5 mpi processes
> assigned, and the cpuset is the following:
> >>>>>
> >>>>> 0, 5, 9, 13, 14, 16, 21, 25, 29, 30
> >>>>>
> >>>>> Cores 0,5 are on socket 0, cores 9, 13, 14 are on socket 1. Binding
> progresses correctly up to and including core 13 (see end of file
> out.1.10.1rc2, before the error). That is 2 cores on socket 0, and 2 cores
> on socket 1. Error is thrown when core 14 should be bound - extra core on
> socket 1 with no corresponding core on socket 0. At that point the returned
> trg_obj points to the first core on the node (os_index 0, socket 0).
> >>>>>
> >>>>> I have submitted a few other jobs and I always had an error in such
> situation. Moreover, if I now use --map-by core instead of socket, the
> error is gone, and I get my expected binding:
> >>>>>
> >>>>> rank 0 @ compute-1-2.local  1, 17,
> >>>>> rank 1 @ compute-1-2.local  2, 18,
> >>>>> rank 2 @ compute-1-2.local  3, 19,
> >>>>> rank 3 @ compute-1-2.local  4, 20,
> >>>>> rank 4 @ compute-1-4.local  1, 17,
> >>>>> rank 5 @ compute-1-4.local  15, 31,
> >>>>> rank 6 @ compute-1-8.local  0, 16,
> >>>>> rank 7 @ compute-1-8.local  5, 21,
> >>>>> rank 8 @ compute-1-8.local  9, 25,
> >>>>> rank 9 @ compute-1-8.local  13, 29,
> >>>>> rank 10 @ compute-1-8.local  14, 30,
> >>>>> rank 11 @ compute-1-13.local  3, 19,
> >>>>> rank 12 @ compute-1-13.local  4, 20,
> >>>>> rank 13 @ compute-1-13.local  5, 21,
> >>>>> rank 14 @ compute-1-13.local  6, 22,
> >>>>> rank 15 @ compute-1-13.local  7, 23,
> >>>>> rank 16 @ compute-1-16.local  12, 28,
> >>>>> rank 17 @ compute-1-16.local  13, 29,
> >>>>> rank 18 @ compute-1-16.local  14, 30,
> >>>>> rank 19 @ compute-1-16.local  15, 31,
> >>>>> rank 20 @ compute-1-23.local  2, 18,
> >>>>> rank 29 @ compute-1-26.local  11, 27,
> >>>>> rank 21 @ compute-1-23.local  3, 19,
> >>>>> rank 30 @ compute-1-26.local  13, 29,
> >>>>> rank 22 @ compute-1-23.local  4, 20,
> >>>>> rank 31 @ compute-1-26.local  15, 31,
> >>>>> rank 23 @ compute-1-23.local  8, 24,
> >>>>> rank 27 @ compute-1-26.local  1, 17,
> >>>>> rank 24 @ compute-1-23.local  13, 29,
> >>>>> rank 28 @ compute-1-26.local  6, 22,
> >>>>> rank 25 @ compute-1-23.local  14, 30,
> >>>>> rank 26 @ compute-1-23.local  15, 31,
> >>>>>
> >>>>> Using --map-by core seems to fix the issue on 1.8.8, 1.10.0 and
> 1.10.1rc1. However, there is still a difference in behavior between
> 1.10.1rc1 and earlier versions. In the SLURM job described in last post,
> 1.10.1rc1 fails to bind only in 1 case, while the earlier versions fail in
> 21 out of 32 cases. You mentioned there was a bug in hwloc. Not sure if it
> can explain the difference in behavior.
> >>>>>
> >>>>> Hope this helps to nail this down.
> >>>>>
> >>>>> Marcin
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 10/04/2015 09:55 AM, Gilles Gouaillardet wrote:
> >>>>>> Ralph,
> >>>>>>
> >>>>>> I suspect ompi tries to bind to threads outside the cpuset.
> >>>>>> this could be pretty similar to a previous issue when ompi tried to
> bind to cores outside the cpuset.
> >>>>>> /* when a core has more than one thread, would ompi assume all the
> threads are available if the core is available ? */
> >>>>>> I will investigate this from tomorrow
> >>>>>>
> >>>>>> Cheers,
> >>>>>>
> >>>>>> Gilles
> >>>>>>
> >>>>>> On Sunday, October 4, 2015, Ralph Castain <r...@open-mpi.org
> <javascript:;>> wrote:
> >>>>>> Thanks - please go ahead and release that allocation as I’m not
> going to get to this immediately. I’ve got several hot irons in the fire
> right now, and I’m not sure when I’ll get a chance to track this down.
> >>>>>>
> >>>>>> Gilles or anyone else who might have time - feel free to take a
> gander and see if something pops out at you.
> >>>>>>
> >>>>>> Ralph
> >>>>>>
> >>>>>>
> >>>>>>> On Oct 3, 2015, at 11:05 AM, marcin.krotkiewski <
> marcin.krotkiew...@gmail.com <javascript:;>> wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>> Done. I have compiled 1.10.0 and 1.10.rc1 with --enable-debug and
> executed
> >>>>>>>
> >>>>>>> mpirun --mca rmaps_base_verbose 10 --hetero-nodes
> --report-bindings --bind-to core -np 32 ./affinity
> >>>>>>>
> >>>>>>> In case of 1.10.rc1 I have also added :overload-allowed - output
> in a separate file. This option did not make much difference for 1.10.0, so
> I did not attach it here.
> >>>>>>>
> >>>>>>> First thing I noted for 1.10.0 are lines like
> >>>>>>>
> >>>>>>> [login-0-1.local:03399] [[37945,0],0] GOT 1 CPUS
> >>>>>>> [login-0-1.local:03399] [[37945,0],0] PROC [[37945,1],27] BITMAP
> >>>>>>> [login-0-1.local:03399] [[37945,0],0] PROC [[37945,1],27] ON c1-26
> IS NOT BOUND
> >>>>>>>
> >>>>>>> with an empty BITMAP.
> >>>>>>>
> >>>>>>> The SLURM environment is
> >>>>>>>
> >>>>>>> set | grep SLURM
> >>>>>>> SLURM_JOBID=12714491
> >>>>>>> SLURM_JOB_CPUS_PER_NODE='4,2,5(x2),4,7,5'
> >>>>>>> SLURM_JOB_ID=12714491
> >>>>>>> SLURM_JOB_NODELIST='c1-[2,4,8,13,16,23,26]'
> >>>>>>> SLURM_JOB_NUM_NODES=7
> >>>>>>> SLURM_JOB_PARTITION=normal
> >>>>>>> SLURM_MEM_PER_CPU=2048
> >>>>>>> SLURM_NNODES=7
> >>>>>>> SLURM_NODELIST='c1-[2,4,8,13,16,23,26]'
> >>>>>>> SLURM_NODE_ALIASES='(null)'
> >>>>>>> SLURM_NPROCS=32
> >>>>>>> SLURM_NTASKS=32
> >>>>>>> SLURM_SUBMIT_DIR=/cluster/home/marcink
> >>>>>>> SLURM_SUBMIT_HOST=login-0-1.local
> >>>>>>> SLURM_TASKS_PER_NODE='4,2,5(x2),4,7,5'
> >>>>>>>
> >>>>>>> I have submitted an interactive job on screen for 120 hours now to
> work with one example, and not change it for every post :)
> >>>>>>>
> >>>>>>> If you need anything else, let me know. I could introduce some
> patch/printfs and recompile, if you need it.
> >>>>>>>
> >>>>>>> Marcin
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On 10/03/2015 07:17 PM, Ralph Castain wrote:
> >>>>>>>> Rats - just realized I have no way to test this as none of the
> machines I can access are setup for cgroup-based multi-tenant. Is this a
> debug version of OMPI? If not, can you rebuild OMPI with —enable-debug?
> >>>>>>>>
> >>>>>>>> Then please run it with —mca rmaps_base_verbose 10 and pass along
> the output.
> >>>>>>>>
> >>>>>>>> Thanks
> >>>>>>>> Ralph
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> On Oct 3, 2015, at 10:09 AM, Ralph Castain <r...@open-mpi.org
> <javascript:;>> wrote:
> >>>>>>>>>
> >>>>>>>>> What version of slurm is this? I might try to debug it here. I’m
> not sure where the problem lies just yet.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> On Oct 3, 2015, at 8:59 AM, marcin.krotkiewski <
> marcin.krotkiew...@gmail.com <javascript:;>> wrote:
> >>>>>>>>>>
> >>>>>>>>>> Here is the output of lstopo. In short, (0,16) are core 0,
> (1,17) - core 1 etc.
> >>>>>>>>>>
> >>>>>>>>>> Machine (64GB)
> >>>>>>>>>>   NUMANode L#0 (P#0 32GB)
> >>>>>>>>>>     Socket L#0 + L3 L#0 (20MB)
> >>>>>>>>>>       L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core
> L#0
> >>>>>>>>>>         PU L#0 (P#0)
> >>>>>>>>>>         PU L#1 (P#16)
> >>>>>>>>>>       L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core
> L#1
> >>>>>>>>>>         PU L#2 (P#1)
> >>>>>>>>>>         PU L#3 (P#17)
> >>>>>>>>>>       L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core
> L#2
> >>>>>>>>>>         PU L#4 (P#2)
> >>>>>>>>>>         PU L#5 (P#18)
> >>>>>>>>>>       L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core
> L#3
> >>>>>>>>>>         PU L#6 (P#3)
> >>>>>>>>>>         PU L#7 (P#19)
> >>>>>>>>>>       L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core
> L#4
> >>>>>>>>>>         PU L#8 (P#4)
> >>>>>>>>>>         PU L#9 (P#20)
> >>>>>>>>>>       L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core
> L#5
> >>>>>>>>>>         PU L#10 (P#5)
> >>>>>>>>>>         PU L#11 (P#21)
> >>>>>>>>>>       L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core
> L#6
> >>>>>>>>>>         PU L#12 (P#6)
> >>>>>>>>>>         PU L#13 (P#22)
> >>>>>>>>>>       L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core
> L#7
> >>>>>>>>>>         PU L#14 (P#7)
> >>>>>>>>>>         PU L#15 (P#23)
> >>>>>>>>>>     HostBridge L#0
> >>>>>>>>>>       PCIBridge
> >>>>>>>>>>         PCI 8086:1521
> >>>>>>>>>>           Net L#0 "eth0"
> >>>>>>>>>>         PCI 8086:1521
> >>>>>>>>>>           Net L#1 "eth1"
> >>>>>>>>>>       PCIBridge
> >>>>>>>>>>         PCI 15b3:1003
> >>>>>>>>>>           Net L#2 "ib0"
> >>>>>>>>>>           OpenFabrics L#3 "mlx4_0"
> >>>>>>>>>>       PCIBridge
> >>>>>>>>>>         PCI 102b:0532
> >>>>>>>>>>       PCI 8086:1d02
> >>>>>>>>>>         Block L#4 "sda"
> >>>>>>>>>>   NUMANode L#1 (P#1 32GB) + Socket L#1 + L3 L#1 (20MB)
> >>>>>>>>>>     L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8
> >>>>>>>>>>       PU L#16 (P#8)
> >>>>>>>>>>       PU L#17 (P#24)
> >>>>>>>>>>     L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9
> >>>>>>>>>>       PU L#18 (P#9)
> >>>>>>>>>>       PU L#19 (P#25)
> >>>>>>>>>>     L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core
> L#10
> >>>>>>>>>>       PU L#20 (P#10)
> >>>>>>>>>>       PU L#21 (P#26)
> >>>>>>>>>>     L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core
> L#11
> >>>>>>>>>>       PU L#22 (P#11)
> >>>>>>>>>>       PU L#23 (P#27)
> >>>>>>>>>>     L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core
> L#12
> >>>>>>>>>>       PU L#24 (P#12)
> >>>>>>>>>>       PU L#25 (P#28)
> >>>>>>>>>>     L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core
> L#13
> >>>>>>>>>>       PU L#26 (P#13)
> >>>>>>>>>>       PU L#27 (P#29)
> >>>>>>>>>>     L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core
> L#14
> >>>>>>>>>>       PU L#28 (P#14)
> >>>>>>>>>>       PU L#29 (P#30)
> >>>>>>>>>>     L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core
> L#15
> >>>>>>>>>>       PU L#30 (P#15)
> >>>>>>>>>>       PU L#31 (P#31)
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On 10/03/2015 05:46 PM, Ralph Castain wrote:
> >>>>>>>>>>> Maybe I’m just misreading your HT map - that slurm nodelist
> syntax is a new one to me, but they tend to change things around. Could you
> run lstopo on one of those compute nodes and send the output?
> >>>>>>>>>>>
> >>>>>>>>>>> I’m just suspicious because I’m not seeing a clear pairing of
> HT numbers in your output, but HT numbering is BIOS-specific and I may just
> not be understanding your particular pattern. Our error message is clearly
> indicating that we are seeing individual HTs (and not complete cores)
> assigned, and I don’t know the source of that confusion.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>> On Oct 3, 2015, at 8:28 AM, marcin.krotkiewski <
> marcin.krotkiew...@gmail.com <javascript:;>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> On 10/03/2015 04:38 PM, Ralph Castain wrote:
> >>>>>>>>>>>>> If mpirun isn’t trying to do any binding, then you will of
> course get the right mapping as we’ll just inherit whatever we received.
> >>>>>>>>>>>> Yes. I meant that whatever you received (what SLURM gives) is
> a correct cpu map and assigns _whole_ CPUs, not a single HT to MPI
> processes. In the case mentioned earlier openmpi should start 6 tasks on
> c1-30. If HT would be treated as separate and independent cores,
> sched_getaffinity of an MPI process started on c1-30 would return a map
> with 6 entries only. In my case it returns a map
>                                with 12 entries - 2 for each core. So one
> process is in fact allocated both HTs, not only one. Is what I'm saying
> correct?
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Looking at your output, it’s pretty clear that you are
> getting independent HTs assigned and not full cores.
> >>>>>>>>>>>> How do you mean? Is the above understanding wrong? I would
> expect that on c1-30 with --bind-to core openmpi should bind to logical
> cores 0 and 16 (rank 0), 1 and 17 (rank 2) and so on. All those logical
> cores are available in sched_getaffinity map, and there is twice as many
> logical cores as there are MPI processes started on the node.
> >>>>>>>>>>>>
> >>>>>>>>>>>>> My guess is that something in slurm has changed such that it
> detects that HT has been enabled, and then begins treating the HTs as
> completely independent cpus.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Try changing “-bind-to core” to “-bind-to hwthread
> -use-hwthread-cpus” and see if that works
> >>>>>>>>>>>>>
> >>>>>>>>>>>> I have and the binding is wrong. For example, I got this
> output
> >>>>>>>>>>>>
> >>>>>>>>>>>> rank 0 @ compute-1-30.local  0,
> >>>>>>>>>>>> rank 1 @ compute-1-30.local  16,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Which means that two ranks have been bound to the same
> physical core (logical cores 0 and 16 are two HTs of the same core). If I
> use --bind-to core, I get the following correct binding
> >>>>>>>>>>>>
> >>>>>>>>>>>> rank 0 @ compute-1-30.local  0, 16,
> >>>>>>>>>>>>
> >>>>>>>>>>>> The problem is many other ranks get bad binding with 'rank
> XXX is not bound (or bound to all available processors)' warning.
> >>>>>>>>>>>>
> >>>>>>>>>>>> But I think I was not entirely correct saying that 1.10.1rc1
> did not fix things. It still might have improved something, but not
> everything. Consider this job:
> >>>>>>>>>>>>
> >>>>>>>>>>>> SLURM_JOB_CPUS_PER_NODE='5,4,6,5(x2),7,5,9,5,7,6'
> >>>>>>>>>>>> SLURM_JOB_NODELIST='c8-[31,34],c9-[30-32,35-36],c10-[31-34]'
> >>>>>>>>>>>>
> >>>>>>>>>>>> If I run 32 tasks as follows (with 1.10.1rc1)
> >>>>>>>>>>>>
> >>>>>>>>>>>> mpirun --hetero-nodes --report-bindings --bind-to core -np 32
> ./affinity
> >>>>>>>>>>>>
> >>>>>>>>>>>> I get the following error:
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> --------------------------------------------------------------------------
> >>>>>>>>>>>> A request was made to bind to that would result in binding
> more
> >>>>>>>>>>>> processes than cpus on a resource:
> >>>>>>>>>>>>
> >>>>>>>>>>>>    Bind to:     CORE
> >>>>>>>>>>>>    Node:        c9-31
> >>>>>>>>>>>>    #processes:  2
> >>>>>>>>>>>>    #cpus:       1
> >>>>>>>>>>>>
> >>>>>>>>>>>> You can override this protection by adding the
> "overload-allowed"
> >>>>>>>>>>>> option to your binding directive.
> >>>>>>>>>>>>
> --------------------------------------------------------------------------
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> If I now use --bind-to core:overload-allowed, then openmpi
> starts and _most_ of the threads are bound correctly (i.e., map contains
> two logical cores in ALL cases), except this case that required the
> overload flag:
> >>>>>>>>>>>>
> >>>>>>>>>>>> rank 15 @ compute-9-31.local   1, 17,
> >>>>>>>>>>>> rank 16 @ compute-9-31.local  11, 27,
> >>>>>>>>>>>> rank 17 @ compute-9-31.local   2, 18,
> >>>>>>>>>>>> rank 18 @ compute-9-31.local  12, 28,
> >>>>>>>>>>>> rank 19 @ compute-9-31.local   1, 17,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Note pair 1,17 is used twice. The original SLURM delivered
> map (no binding) on this node is
> >>>>>>>>>>>>
> >>>>>>>>>>>> rank 15 @ compute-9-31.local  1, 2, 11, 12, 13, 17, 18, 27,
> 28, 29,
> >>>>>>>>>>>> rank 16 @ compute-9-31.local  1, 2, 11, 12, 13, 17, 18, 27,
> 28, 29,
> >>>>>>>>>>>> rank 17 @ compute-9-31.local  1, 2, 11, 12, 13, 17, 18, 27,
> 28, 29,
> >>>>>>>>>>>> rank 18 @ compute-9-31.local  1, 2, 11, 12, 13, 17, 18, 27,
> 28, 29,
> >>>>>>>>>>>> rank 19 @ compute-9-31.local  1, 2, 11, 12, 13, 17, 18, 27,
> 28, 29,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Why does openmpi use cores (1,17) twice instead of using core
> (13,29)? Clearly, the original SLURM-delivered map has 5 CPUs included,
> enough for 5 MPI processes.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Cheers,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Marcin
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Oct 3, 2015, at 7:12 AM, marcin.krotkiewski <
> marcin.krotkiew...@gmail.com <javascript:;>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On 10/03/2015 01:06 PM, Ralph Castain wrote:
> >>>>>>>>>>>>>>> Thanks Marcin. Looking at this, I’m guessing that Slurm
> may be treating HTs as “cores” - i.e., as independent cpus. Any chance that
> is true?
> >>>>>>>>>>>>>> Not to the best of my knowledge, and at least not
> intentionally. SLURM starts as many processes as there are physical cores,
> not threads. To verify this, consider this test case:
> >>>>>>
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> users mailing list
> >>>>>>
> >>>>>> us...@open-mpi.org <javascript:;>
> >>>>>>
> >>>>>> Subscription:
> >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>
> >>>>>> Link to this post:
> >>>>>> http://www.open-mpi.org/community/lists/users/2015/10/27790.php
> >>>>>
> >>>>>
> >>>>>
> >>>>> _______________________________________________
> >>>>> users mailing list
> >>>>>
> >>>>> us...@open-mpi.org <javascript:;>
> >>>>>
> >>>>> Subscription:
> >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>
> >>>>> Link to this post:
> >>>>> http://www.open-mpi.org/community/lists/users/2015/10/27791.php
> >>>>
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> users mailing list
> >>>>
> >>>> us...@open-mpi.org <javascript:;>
> >>>>
> >>>> Subscription:
> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>
> >>>> Link to this post:
> >>>> http://www.open-mpi.org/community/lists/users/2015/10/27792.php
> >>>
> >>>
> >>>
> >>> _______________________________________________
> >>> users mailing list
> >>>
> >>> us...@open-mpi.org <javascript:;>
> >>>
> >>> Subscription:
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>
> >>> Link to this post:
> >>> http://www.open-mpi.org/community/lists/users/2015/10/27814.php
> >>
> >>
> >>
> >> _______________________________________________
> >> users mailing list
> >>
> >> us...@open-mpi.org <javascript:;>
> >>
> >> Subscription:
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>
> >> Link to this post:
> >> http://www.open-mpi.org/community/lists/users/2015/10/27815.php
> >
> >
> <heterogeneous_topologies.patch>_______________________________________________
> > users mailing list
> > us...@open-mpi.org <javascript:;>
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> > Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/10/27827.php
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com <javascript:;>
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org <javascript:;>
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/10/27828.php
>

Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

Reply via email to