Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

Jeff Squyres (jsquyres) Wed, 7 Oct 2015 06:37:14 -0400 (EDT)

Is this something that needs to go into v1.10.1?

If so, a PR needs to be filed ASAP.  We were supposed to make the next 1.10.1 
RC yesterday, but slipped to today due to some last second patches.



> On Oct 7, 2015, at 4:32 AM, Gilles Gouaillardet <gil...@rist.or.jp> wrote:
> 
> Marcin,
> 
> here is a patch for the master, hopefully it fixes all the issues we discussed
> i will make sure it applies fine vs latest 1.10 tarball from tomorrow
> 
> Cheers,
> 
> Gilles
> 
> 
> On 10/6/2015 7:22 PM, marcin.krotkiewski wrote:
>> Gilles,
>> 
>> Yes, it seemed that all was fine with binding in the patched 1.10.1rc1 - 
>> thank you. Eagerly waiting for the other patches, let me know and I will 
>> test them later this week.
>> 
>> Marcin
>> 
>> 
>> 
>> On 10/06/2015 12:09 PM, Gilles Gouaillardet wrote:
>>> Marcin,
>>> 
>>> my understanding is that in this case, patched v1.10.1rc1 is working just 
>>> fine.
>>> am I right ?
>>> 
>>> I prepared two patches
>>> one to remove the warning when binding on one core if only one core is 
>>> available,
>>> an other one to add a warning if the user asks a binding policy that makes 
>>> no sense with the required mapping policy
>>> 
>>> I will finalize them tomorrow hopefully
>>> 
>>> Cheers,
>>> 
>>> Gilles
>>> 
>>> On Tuesday, October 6, 2015, marcin.krotkiewski 
>>> <marcin.krotkiew...@gmail.com> wrote:
>>> Hi, Gilles
>>>> you mentionned you had one failure with 1.10.1rc1 and -bind-to core
>>>> could you please send the full details (script, allocation and output)
>>>> in your slurm script, you can do
>>>> srun -N $SLURM_NNODES -n $SLURM_NNODES --cpu_bind=none -l grep 
>>>> Cpus_allowed_list /proc/self/status
>>>> before invoking mpirun
>>>> 
>>> It was an interactive job allocated with
>>> 
>>> salloc --account=staff --ntasks=32 --mem-per-cpu=2G --time=120:0:0
>>> 
>>> The slurm environment is the following
>>> 
>>> SLURM_JOBID=12714491
>>> SLURM_JOB_CPUS_PER_NODE='4,2,5(x2),4,7,5'
>>> SLURM_JOB_ID=12714491
>>> SLURM_JOB_NODELIST='c1-[2,4,8,13,16,23,26]'
>>> SLURM_JOB_NUM_NODES=7
>>> SLURM_JOB_PARTITION=normal
>>> SLURM_MEM_PER_CPU=2048
>>> SLURM_NNODES=7
>>> SLURM_NODELIST='c1-[2,4,8,13,16,23,26]'
>>> SLURM_NODE_ALIASES='(null)'
>>> SLURM_NPROCS=32
>>> SLURM_NTASKS=32
>>> SLURM_SUBMIT_DIR=/cluster/home/marcink
>>> SLURM_SUBMIT_HOST=login-0-1.local
>>> SLURM_TASKS_PER_NODE='4,2,5(x2),4,7,5'
>>> 
>>> The output of the command you asked for is
>>> 
>>> 0: c1-2.local  Cpus_allowed_list:        1-4,17-20
>>> 1: c1-4.local  Cpus_allowed_list:        1,15,17,31
>>> 2: c1-8.local  Cpus_allowed_list:        0,5,9,13-14,16,21,25,29-30
>>> 3: c1-13.local  Cpus_allowed_list:       3-7,19-23
>>> 4: c1-16.local  Cpus_allowed_list:       12-15,28-31
>>> 5: c1-23.local  Cpus_allowed_list:       2-4,8,13-15,18-20,24,29-31
>>> 6: c1-26.local  Cpus_allowed_list:       1,6,11,13,15,17,22,27,29,31
>>> 
>>> Running with command
>>> 
>>> mpirun --mca rmaps_base_verbose 10 --hetero-nodes --bind-to core 
>>> --report-bindings --map-by socket -np 32 ./affinity
>>> 
>>> I have attached two output files: one for the original 1.10.1rc1, one for 
>>> the patched version.
>>> 
>>> When I said 'failed in one case' I was not precise. I got an error on node 
>>> c1-8, which was the first one to have different number of MPI processes on 
>>> the two sockets. It would also fail on some later nodes, just               
>>>   that because of the error we never got there.
>>> 
>>> Let me know if you need more.
>>> 
>>> Marcin
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>>> Cheers,
>>>> 
>>>> Gilles
>>>> 
>>>> On 10/4/2015 11:55 PM, marcin.krotkiewski wrote:
>>>>> Hi, all,
>>>>> 
>>>>> I played a bit more and it seems that the problem results from
>>>>> 
>>>>> trg_obj = opal_hwloc_base_find_min_bound_target_under_obj()
>>>>> 
>>>>> called in rmaps_base_binding.c / bind_downwards being wrong. I do not 
>>>>> know the reason, but I think I know when the problem happens (at least on 
>>>>> 1.10.1rc1). It seems that by default openmpi maps by socket. The error 
>>>>> happens when for a given compute node there is a different number of 
>>>>> cores used on each socket. Consider previously studied case (the debug 
>>>>> outputs I sent in last post). c1-8, which was source of error, has 5 mpi 
>>>>> processes assigned, and the cpuset is the following:
>>>>> 
>>>>> 0, 5, 9, 13, 14, 16, 21, 25, 29, 30
>>>>> 
>>>>> Cores 0,5 are on socket 0, cores 9, 13, 14 are on socket 1. Binding 
>>>>> progresses correctly up to and including core 13 (see end of file 
>>>>> out.1.10.1rc2, before the error). That is 2 cores on socket 0, and 2 
>>>>> cores on socket 1. Error is thrown when core 14 should be bound - extra 
>>>>> core on socket 1 with no corresponding core on socket 0. At that point 
>>>>> the returned trg_obj points to the first core on the node (os_index 0, 
>>>>> socket 0).
>>>>> 
>>>>> I have submitted a few other jobs and I always had an error in such 
>>>>> situation. Moreover, if I now use --map-by core instead of socket, the 
>>>>> error is gone, and I get my expected binding:
>>>>> 
>>>>> rank 0 @ compute-1-2.local  1, 17,
>>>>> rank 1 @ compute-1-2.local  2, 18,
>>>>> rank 2 @ compute-1-2.local  3, 19,
>>>>> rank 3 @ compute-1-2.local  4, 20,
>>>>> rank 4 @ compute-1-4.local  1, 17,
>>>>> rank 5 @ compute-1-4.local  15, 31,
>>>>> rank 6 @ compute-1-8.local  0, 16,
>>>>> rank 7 @ compute-1-8.local  5, 21,
>>>>> rank 8 @ compute-1-8.local  9, 25,
>>>>> rank 9 @ compute-1-8.local  13, 29,
>>>>> rank 10 @ compute-1-8.local  14, 30,
>>>>> rank 11 @ compute-1-13.local  3, 19,
>>>>> rank 12 @ compute-1-13.local  4, 20,
>>>>> rank 13 @ compute-1-13.local  5, 21,
>>>>> rank 14 @ compute-1-13.local  6, 22,
>>>>> rank 15 @ compute-1-13.local  7, 23,
>>>>> rank 16 @ compute-1-16.local  12, 28,
>>>>> rank 17 @ compute-1-16.local  13, 29,
>>>>> rank 18 @ compute-1-16.local  14, 30,
>>>>> rank 19 @ compute-1-16.local  15, 31,
>>>>> rank 20 @ compute-1-23.local  2, 18,
>>>>> rank 29 @ compute-1-26.local  11, 27,
>>>>> rank 21 @ compute-1-23.local  3, 19,
>>>>> rank 30 @ compute-1-26.local  13, 29,
>>>>> rank 22 @ compute-1-23.local  4, 20,
>>>>> rank 31 @ compute-1-26.local  15, 31,
>>>>> rank 23 @ compute-1-23.local  8, 24,
>>>>> rank 27 @ compute-1-26.local  1, 17,
>>>>> rank 24 @ compute-1-23.local  13, 29,
>>>>> rank 28 @ compute-1-26.local  6, 22,
>>>>> rank 25 @ compute-1-23.local  14, 30,
>>>>> rank 26 @ compute-1-23.local  15, 31,
>>>>> 
>>>>> Using --map-by core seems to fix the issue on 1.8.8, 1.10.0 and 
>>>>> 1.10.1rc1. However, there is still a difference in behavior between 
>>>>> 1.10.1rc1 and earlier versions. In the SLURM job described in last post, 
>>>>> 1.10.1rc1 fails to bind only in 1 case, while the earlier versions fail 
>>>>> in 21 out of 32 cases. You mentioned there was a bug in hwloc. Not sure 
>>>>> if it can explain the difference in behavior.
>>>>> 
>>>>> Hope this helps to nail this down.
>>>>> 
>>>>> Marcin
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On 10/04/2015 09:55 AM, Gilles Gouaillardet wrote:
>>>>>> Ralph,
>>>>>> 
>>>>>> I suspect ompi tries to bind to threads outside the cpuset.
>>>>>> this could be pretty similar to a previous issue when ompi tried to bind 
>>>>>> to cores outside the cpuset.
>>>>>> /* when a core has more than one thread, would ompi assume all the 
>>>>>> threads are available if the core is available ? */
>>>>>> I will investigate this from tomorrow
>>>>>> 
>>>>>> Cheers,
>>>>>> 
>>>>>> Gilles
>>>>>> 
>>>>>> On Sunday, October 4, 2015, Ralph Castain <r...@open-mpi.org> wrote:
>>>>>> Thanks - please go ahead and release that allocation as I’m not going to 
>>>>>> get to this immediately. I’ve got several hot irons in the fire right 
>>>>>> now, and I’m not sure when I’ll get a chance to track this down.
>>>>>> 
>>>>>> Gilles or anyone else who might have time - feel free to take a gander 
>>>>>> and see if something pops out at you.
>>>>>> 
>>>>>> Ralph
>>>>>> 
>>>>>> 
>>>>>>> On Oct 3, 2015, at 11:05 AM, marcin.krotkiewski 
>>>>>>> <marcin.krotkiew...@gmail.com> wrote:
>>>>>>> 
>>>>>>> 
>>>>>>> Done. I have compiled 1.10.0 and 1.10.rc1 with --enable-debug and 
>>>>>>> executed
>>>>>>> 
>>>>>>> mpirun --mca rmaps_base_verbose 10 --hetero-nodes --report-bindings 
>>>>>>> --bind-to core -np 32 ./affinity
>>>>>>> 
>>>>>>> In case of 1.10.rc1 I have also added :overload-allowed - output in a 
>>>>>>> separate file. This option did not make much difference for 1.10.0, so 
>>>>>>> I did not attach it here.
>>>>>>> 
>>>>>>> First thing I noted for 1.10.0 are lines like
>>>>>>> 
>>>>>>> [login-0-1.local:03399] [[37945,0],0] GOT 1 CPUS
>>>>>>> [login-0-1.local:03399] [[37945,0],0] PROC [[37945,1],27] BITMAP
>>>>>>> [login-0-1.local:03399] [[37945,0],0] PROC [[37945,1],27] ON c1-26 IS 
>>>>>>> NOT BOUND
>>>>>>> 
>>>>>>> with an empty BITMAP.
>>>>>>> 
>>>>>>> The SLURM environment is
>>>>>>> 
>>>>>>> set | grep SLURM
>>>>>>> SLURM_JOBID=12714491
>>>>>>> SLURM_JOB_CPUS_PER_NODE='4,2,5(x2),4,7,5'
>>>>>>> SLURM_JOB_ID=12714491
>>>>>>> SLURM_JOB_NODELIST='c1-[2,4,8,13,16,23,26]'
>>>>>>> SLURM_JOB_NUM_NODES=7
>>>>>>> SLURM_JOB_PARTITION=normal
>>>>>>> SLURM_MEM_PER_CPU=2048
>>>>>>> SLURM_NNODES=7
>>>>>>> SLURM_NODELIST='c1-[2,4,8,13,16,23,26]'
>>>>>>> SLURM_NODE_ALIASES='(null)'
>>>>>>> SLURM_NPROCS=32
>>>>>>> SLURM_NTASKS=32
>>>>>>> SLURM_SUBMIT_DIR=/cluster/home/marcink
>>>>>>> SLURM_SUBMIT_HOST=login-0-1.local
>>>>>>> SLURM_TASKS_PER_NODE='4,2,5(x2),4,7,5'
>>>>>>> 
>>>>>>> I have submitted an interactive job on screen for 120 hours now to work 
>>>>>>> with one example, and not change it for every post :)
>>>>>>> 
>>>>>>> If you need anything else, let me know. I could introduce some 
>>>>>>> patch/printfs and recompile, if you need it.
>>>>>>> 
>>>>>>> Marcin
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On 10/03/2015 07:17 PM, Ralph Castain wrote:
>>>>>>>> Rats - just realized I have no way to test this as none of the 
>>>>>>>> machines I can access are setup for cgroup-based multi-tenant. Is this 
>>>>>>>> a debug version of OMPI? If not, can you rebuild OMPI with 
>>>>>>>> —enable-debug?
>>>>>>>> 
>>>>>>>> Then please run it with —mca rmaps_base_verbose 10 and pass along the 
>>>>>>>> output.
>>>>>>>> 
>>>>>>>> Thanks
>>>>>>>> Ralph
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On Oct 3, 2015, at 10:09 AM, Ralph Castain <r...@open-mpi.org> wrote:
>>>>>>>>> 
>>>>>>>>> What version of slurm is this? I might try to debug it here. I’m not 
>>>>>>>>> sure where the problem lies just yet.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> On Oct 3, 2015, at 8:59 AM, marcin.krotkiewski 
>>>>>>>>>> <marcin.krotkiew...@gmail.com> wrote:
>>>>>>>>>> 
>>>>>>>>>> Here is the output of lstopo. In short, (0,16) are core 0, (1,17) - 
>>>>>>>>>> core 1 etc.
>>>>>>>>>> 
>>>>>>>>>> Machine (64GB)
>>>>>>>>>>   NUMANode L#0 (P#0 32GB)
>>>>>>>>>>     Socket L#0 + L3 L#0 (20MB)
>>>>>>>>>>       L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
>>>>>>>>>>         PU L#0 (P#0)
>>>>>>>>>>         PU L#1 (P#16)
>>>>>>>>>>       L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
>>>>>>>>>>         PU L#2 (P#1)
>>>>>>>>>>         PU L#3 (P#17)
>>>>>>>>>>       L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
>>>>>>>>>>         PU L#4 (P#2)
>>>>>>>>>>         PU L#5 (P#18)
>>>>>>>>>>       L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
>>>>>>>>>>         PU L#6 (P#3)
>>>>>>>>>>         PU L#7 (P#19)
>>>>>>>>>>       L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
>>>>>>>>>>         PU L#8 (P#4)
>>>>>>>>>>         PU L#9 (P#20)
>>>>>>>>>>       L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
>>>>>>>>>>         PU L#10 (P#5)
>>>>>>>>>>         PU L#11 (P#21)
>>>>>>>>>>       L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6
>>>>>>>>>>         PU L#12 (P#6)
>>>>>>>>>>         PU L#13 (P#22)
>>>>>>>>>>       L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7
>>>>>>>>>>         PU L#14 (P#7)
>>>>>>>>>>         PU L#15 (P#23)
>>>>>>>>>>     HostBridge L#0
>>>>>>>>>>       PCIBridge
>>>>>>>>>>         PCI 8086:1521
>>>>>>>>>>           Net L#0 "eth0"
>>>>>>>>>>         PCI 8086:1521
>>>>>>>>>>           Net L#1 "eth1"
>>>>>>>>>>       PCIBridge
>>>>>>>>>>         PCI 15b3:1003
>>>>>>>>>>           Net L#2 "ib0"
>>>>>>>>>>           OpenFabrics L#3 "mlx4_0"
>>>>>>>>>>       PCIBridge
>>>>>>>>>>         PCI 102b:0532
>>>>>>>>>>       PCI 8086:1d02
>>>>>>>>>>         Block L#4 "sda"
>>>>>>>>>>   NUMANode L#1 (P#1 32GB) + Socket L#1 + L3 L#1 (20MB)
>>>>>>>>>>     L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8
>>>>>>>>>>       PU L#16 (P#8)
>>>>>>>>>>       PU L#17 (P#24)
>>>>>>>>>>     L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9
>>>>>>>>>>       PU L#18 (P#9)
>>>>>>>>>>       PU L#19 (P#25)
>>>>>>>>>>     L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10
>>>>>>>>>>       PU L#20 (P#10)
>>>>>>>>>>       PU L#21 (P#26)
>>>>>>>>>>     L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11
>>>>>>>>>>       PU L#22 (P#11)
>>>>>>>>>>       PU L#23 (P#27)
>>>>>>>>>>     L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12
>>>>>>>>>>       PU L#24 (P#12)
>>>>>>>>>>       PU L#25 (P#28)
>>>>>>>>>>     L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13
>>>>>>>>>>       PU L#26 (P#13)
>>>>>>>>>>       PU L#27 (P#29)
>>>>>>>>>>     L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14
>>>>>>>>>>       PU L#28 (P#14)
>>>>>>>>>>       PU L#29 (P#30)
>>>>>>>>>>     L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15
>>>>>>>>>>       PU L#30 (P#15)
>>>>>>>>>>       PU L#31 (P#31)
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On 10/03/2015 05:46 PM, Ralph Castain wrote:
>>>>>>>>>>> Maybe I’m just misreading your HT map - that slurm nodelist syntax 
>>>>>>>>>>> is a new one to me, but they tend to change things around. Could 
>>>>>>>>>>> you run lstopo on one of those compute nodes and send the output?
>>>>>>>>>>> 
>>>>>>>>>>> I’m just suspicious because I’m not seeing a clear pairing of HT 
>>>>>>>>>>> numbers in your output, but HT numbering is BIOS-specific and I may 
>>>>>>>>>>> just not be understanding your particular pattern. Our error 
>>>>>>>>>>> message is clearly indicating that we are seeing individual HTs 
>>>>>>>>>>> (and not complete cores) assigned, and I don’t know the source of 
>>>>>>>>>>> that confusion.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> On Oct 3, 2015, at 8:28 AM, marcin.krotkiewski 
>>>>>>>>>>>> <marcin.krotkiew...@gmail.com> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On 10/03/2015 04:38 PM, Ralph Castain wrote:
>>>>>>>>>>>>> If mpirun isn’t trying to do any binding, then you will of course 
>>>>>>>>>>>>> get the right mapping as we’ll just inherit whatever we received.
>>>>>>>>>>>> Yes. I meant that whatever you received (what SLURM gives) is a 
>>>>>>>>>>>> correct cpu map and assigns _whole_ CPUs, not a single HT to MPI 
>>>>>>>>>>>> processes. In the case mentioned earlier openmpi should start 6 
>>>>>>>>>>>> tasks on c1-30. If HT would be treated as separate and independent 
>>>>>>>>>>>> cores, sched_getaffinity of an MPI process started on c1-30 would 
>>>>>>>>>>>> return a map with 6 entries only. In my case it returns a map      
>>>>>>>>>>>>                                                      with 12 
>>>>>>>>>>>> entries - 2 for each core. So one  process is in fact allocated 
>>>>>>>>>>>> both HTs, not only one. Is what I'm saying correct?
>>>>>>>>>>>> 
>>>>>>>>>>>>> Looking at your output, it’s pretty clear that you are getting 
>>>>>>>>>>>>> independent HTs assigned and not full cores. 
>>>>>>>>>>>> How do you mean? Is the above understanding wrong? I would expect 
>>>>>>>>>>>> that on c1-30 with --bind-to core openmpi should bind to logical 
>>>>>>>>>>>> cores 0 and 16 (rank 0), 1 and 17 (rank 2) and so on. All those 
>>>>>>>>>>>> logical cores are available in sched_getaffinity map, and there is 
>>>>>>>>>>>> twice as many logical cores as there are MPI processes started on 
>>>>>>>>>>>> the node.
>>>>>>>>>>>> 
>>>>>>>>>>>>> My guess is that something in slurm has changed such that it 
>>>>>>>>>>>>> detects that HT has been enabled, and then begins treating the 
>>>>>>>>>>>>> HTs as completely independent cpus.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Try changing “-bind-to core” to “-bind-to hwthread  
>>>>>>>>>>>>> -use-hwthread-cpus” and see if that works
>>>>>>>>>>>>> 
>>>>>>>>>>>> I have and the binding is wrong. For example, I got this output
>>>>>>>>>>>> 
>>>>>>>>>>>> rank 0 @ compute-1-30.local  0,
>>>>>>>>>>>> rank 1 @ compute-1-30.local  16,
>>>>>>>>>>>> 
>>>>>>>>>>>> Which means that two ranks have been bound to the same physical 
>>>>>>>>>>>> core (logical cores 0 and 16 are two HTs of the same core). If I 
>>>>>>>>>>>> use --bind-to core, I get the following correct binding
>>>>>>>>>>>> 
>>>>>>>>>>>> rank 0 @ compute-1-30.local  0, 16,
>>>>>>>>>>>> 
>>>>>>>>>>>> The problem is many other ranks get bad binding with 'rank XXX is 
>>>>>>>>>>>> not bound (or bound to all available processors)' warning.
>>>>>>>>>>>> 
>>>>>>>>>>>> But I think I was not entirely correct saying that 1.10.1rc1 did 
>>>>>>>>>>>> not fix things. It still might have improved something, but not 
>>>>>>>>>>>> everything. Consider this job:
>>>>>>>>>>>> 
>>>>>>>>>>>> SLURM_JOB_CPUS_PER_NODE='5,4,6,5(x2),7,5,9,5,7,6'
>>>>>>>>>>>> SLURM_JOB_NODELIST='c8-[31,34],c9-[30-32,35-36],c10-[31-34]'
>>>>>>>>>>>> 
>>>>>>>>>>>> If I run 32 tasks as follows (with 1.10.1rc1)
>>>>>>>>>>>> 
>>>>>>>>>>>> mpirun --hetero-nodes --report-bindings --bind-to core -np 32 
>>>>>>>>>>>> ./affinity
>>>>>>>>>>>> 
>>>>>>>>>>>> I get the following error:
>>>>>>>>>>>> 
>>>>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>>>>> A request was made to bind to that would result in binding more
>>>>>>>>>>>> processes than cpus on a resource:
>>>>>>>>>>>> 
>>>>>>>>>>>>    Bind to:     CORE
>>>>>>>>>>>>    Node:        c9-31
>>>>>>>>>>>>    #processes:  2
>>>>>>>>>>>>    #cpus:       1
>>>>>>>>>>>> 
>>>>>>>>>>>> You can override this protection by adding the "overload-allowed"
>>>>>>>>>>>> option to your binding directive.
>>>>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> If I now use --bind-to core:overload-allowed, then openmpi starts 
>>>>>>>>>>>> and _most_ of the threads are bound correctly (i.e., map contains 
>>>>>>>>>>>> two logical cores in ALL cases), except this case that required 
>>>>>>>>>>>> the overload flag:
>>>>>>>>>>>> 
>>>>>>>>>>>> rank 15 @ compute-9-31.local   1, 17,
>>>>>>>>>>>> rank 16 @ compute-9-31.local  11, 27,
>>>>>>>>>>>> rank 17 @ compute-9-31.local   2, 18, 
>>>>>>>>>>>> rank 18 @ compute-9-31.local  12, 28,
>>>>>>>>>>>> rank 19 @ compute-9-31.local   1, 17,
>>>>>>>>>>>> 
>>>>>>>>>>>> Note pair 1,17 is used twice. The original SLURM delivered map (no 
>>>>>>>>>>>> binding) on this node is
>>>>>>>>>>>> 
>>>>>>>>>>>> rank 15 @ compute-9-31.local  1, 2, 11, 12, 13, 17, 18, 27, 28, 
>>>>>>>>>>>> 29, 
>>>>>>>>>>>> rank 16 @ compute-9-31.local  1, 2, 11, 12, 13, 17, 18, 27, 28, 29,
>>>>>>>>>>>> rank 17 @ compute-9-31.local  1, 2, 11, 12, 13, 17, 18, 27, 28, 29,
>>>>>>>>>>>> rank 18 @ compute-9-31.local  1, 2, 11, 12, 13, 17, 18, 27, 28, 29,
>>>>>>>>>>>> rank 19 @ compute-9-31.local  1, 2, 11, 12, 13, 17, 18, 27, 28, 29,
>>>>>>>>>>>> 
>>>>>>>>>>>> Why does openmpi use cores (1,17) twice instead of using core 
>>>>>>>>>>>> (13,29)? Clearly, the original SLURM-delivered map has 5 CPUs 
>>>>>>>>>>>> included, enough for 5 MPI processes. 
>>>>>>>>>>>> 
>>>>>>>>>>>> Cheers,
>>>>>>>>>>>> 
>>>>>>>>>>>> Marcin
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Oct 3, 2015, at 7:12 AM, marcin.krotkiewski 
>>>>>>>>>>>>>> <marcin.krotkiew...@gmail.com> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On 10/03/2015 01:06 PM, Ralph Castain wrote:
>>>>>>>>>>>>>>> Thanks Marcin. Looking at this, I’m guessing that Slurm may be 
>>>>>>>>>>>>>>> treating HTs as “cores” - i.e., as independent cpus. Any chance 
>>>>>>>>>>>>>>> that is true?
>>>>>>>>>>>>>> Not to the best of my knowledge, and at least not intentionally. 
>>>>>>>>>>>>>> SLURM starts as many processes as there are physical cores, not 
>>>>>>>>>>>>>> threads. To verify this, consider this test case:
>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> 
>>>>>> us...@open-mpi.org
>>>>>> 
>>>>>> Subscription: 
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> 
>>>>>> Link to this post: 
>>>>>> http://www.open-mpi.org/community/lists/users/2015/10/27790.php
>>>>> 
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> 
>>>>> us...@open-mpi.org
>>>>> 
>>>>> Subscription: 
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> 
>>>>> Link to this post: 
>>>>> http://www.open-mpi.org/community/lists/users/2015/10/27791.php
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> 
>>>> us...@open-mpi.org
>>>> 
>>>> Subscription: 
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> 
>>>> Link to this post: 
>>>> http://www.open-mpi.org/community/lists/users/2015/10/27792.php
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> users mailing list
>>> 
>>> us...@open-mpi.org
>>> 
>>> Subscription: 
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/users/2015/10/27814.php
>> 
>> 
>> 
>> _______________________________________________
>> users mailing list
>> 
>> us...@open-mpi.org
>> 
>> Subscription: 
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2015/10/27815.php
> 
> <heterogeneous_topologies.patch>_______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/10/27827.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

Reply via email to