Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

Ralph Castain Sat, 3 Oct 2015 15:40:04 -0400 (EDT)

Thanks - please go ahead and release that allocation as I’m not going to get to 
this immediately. I’ve got several hot irons in the fire right now, and I’m not 
sure when I’ll get a chance to track this down.


Gilles or anyone else who might have time - feel free to take a gander and see 
if something pops out at you.

Ralph


> On Oct 3, 2015, at 11:05 AM, marcin.krotkiewski 
> <marcin.krotkiew...@gmail.com> wrote:
> 
> 
> Done. I have compiled 1.10.0 and 1.10.rc1 with --enable-debug and executed
> 
> mpirun --mca rmaps_base_verbose 10 --hetero-nodes --report-bindings --bind-to 
> core -np 32 ./affinity
> 
> In case of 1.10.rc1 I have also added :overload-allowed - output in a 
> separate file. This option did not make much difference for 1.10.0, so I did 
> not attach it here.
> 
> First thing I noted for 1.10.0 are lines like
> 
> [login-0-1.local:03399] [[37945,0],0] GOT 1 CPUS
> [login-0-1.local:03399] [[37945,0],0] PROC [[37945,1],27] BITMAP
> [login-0-1.local:03399] [[37945,0],0] PROC [[37945,1],27] ON c1-26 IS NOT 
> BOUND
> 
> with an empty BITMAP.
> 
> The SLURM environment is
> 
> set | grep SLURM
> SLURM_JOBID=12714491
> SLURM_JOB_CPUS_PER_NODE='4,2,5(x2),4,7,5'
> SLURM_JOB_ID=12714491
> SLURM_JOB_NODELIST='c1-[2,4,8,13,16,23,26]'
> SLURM_JOB_NUM_NODES=7
> SLURM_JOB_PARTITION=normal
> SLURM_MEM_PER_CPU=2048
> SLURM_NNODES=7
> SLURM_NODELIST='c1-[2,4,8,13,16,23,26]'
> SLURM_NODE_ALIASES='(null)'
> SLURM_NPROCS=32
> SLURM_NTASKS=32
> SLURM_SUBMIT_DIR=/cluster/home/marcink
> SLURM_SUBMIT_HOST=login-0-1.local
> SLURM_TASKS_PER_NODE='4,2,5(x2),4,7,5'
> 
> I have submitted an interactive job on screen for 120 hours now to work with 
> one example, and not change it for every post :)
> 
> If you need anything else, let me know. I could introduce some patch/printfs 
> and recompile, if you need it.
> 
> Marcin
> 
> 
> 
> On 10/03/2015 07:17 PM, Ralph Castain wrote:
>> Rats - just realized I have no way to test this as none of the machines I 
>> can access are setup for cgroup-based multi-tenant. Is this a debug version 
>> of OMPI? If not, can you rebuild OMPI with —enable-debug?
>> 
>> Then please run it with —mca rmaps_base_verbose 10 and pass along the output.
>> 
>> Thanks
>> Ralph
>> 
>> 
>>> On Oct 3, 2015, at 10:09 AM, Ralph Castain <r...@open-mpi.org 
>>> <mailto:r...@open-mpi.org>> wrote:
>>> 
>>> What version of slurm is this? I might try to debug it here. I’m not sure 
>>> where the problem lies just yet.
>>> 
>>> 
>>>> On Oct 3, 2015, at 8:59 AM, marcin.krotkiewski 
>>>> <marcin.krotkiew...@gmail.com <mailto:marcin.krotkiew...@gmail.com>> wrote:
>>>> 
>>>> Here is the output of lstopo. In short, (0,16) are core 0, (1,17) - core 1 
>>>> etc.
>>>> 
>>>> Machine (64GB)
>>>>   NUMANode L#0 (P#0 32GB)
>>>>     Socket L#0 + L3 L#0 (20MB)
>>>>       L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
>>>>         PU L#0 (P#0)
>>>>         PU L#1 (P#16)
>>>>       L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
>>>>         PU L#2 (P#1)
>>>>         PU L#3 (P#17)
>>>>       L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
>>>>         PU L#4 (P#2)
>>>>         PU L#5 (P#18)
>>>>       L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
>>>>         PU L#6 (P#3)
>>>>         PU L#7 (P#19)
>>>>       L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
>>>>         PU L#8 (P#4)
>>>>         PU L#9 (P#20)
>>>>       L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
>>>>         PU L#10 (P#5)
>>>>         PU L#11 (P#21)
>>>>       L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6
>>>>         PU L#12 (P#6)
>>>>         PU L#13 (P#22)
>>>>       L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7
>>>>         PU L#14 (P#7)
>>>>         PU L#15 (P#23)
>>>>     HostBridge L#0
>>>>       PCIBridge
>>>>         PCI 8086:1521
>>>>           Net L#0 "eth0"
>>>>         PCI 8086:1521
>>>>           Net L#1 "eth1"
>>>>       PCIBridge
>>>>         PCI 15b3:1003
>>>>           Net L#2 "ib0"
>>>>           OpenFabrics L#3 "mlx4_0"
>>>>       PCIBridge
>>>>         PCI 102b:0532
>>>>       PCI 8086:1d02
>>>>         Block L#4 "sda"
>>>>   NUMANode L#1 (P#1 32GB) + Socket L#1 + L3 L#1 (20MB)
>>>>     L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8
>>>>       PU L#16 (P#8)
>>>>       PU L#17 (P#24)
>>>>     L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9
>>>>       PU L#18 (P#9)
>>>>       PU L#19 (P#25)
>>>>     L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10
>>>>       PU L#20 (P#10)
>>>>       PU L#21 (P#26)
>>>>     L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11
>>>>       PU L#22 (P#11)
>>>>       PU L#23 (P#27)
>>>>     L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12
>>>>       PU L#24 (P#12)
>>>>       PU L#25 (P#28)
>>>>     L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13
>>>>       PU L#26 (P#13)
>>>>       PU L#27 (P#29)
>>>>     L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14
>>>>       PU L#28 (P#14)
>>>>       PU L#29 (P#30)
>>>>     L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15
>>>>       PU L#30 (P#15)
>>>>       PU L#31 (P#31)
>>>> 
>>>> 
>>>> 
>>>> On 10/03/2015 05:46 PM, Ralph Castain wrote:
>>>>> Maybe I’m just misreading your HT map - that slurm nodelist syntax is a 
>>>>> new one to me, but they tend to change things around. Could you run 
>>>>> lstopo on one of those compute nodes and send the output?
>>>>> 
>>>>> I’m just suspicious because I’m not seeing a clear pairing of HT numbers 
>>>>> in your output, but HT numbering is BIOS-specific and I may just not be 
>>>>> understanding your particular pattern. Our error message is clearly 
>>>>> indicating that we are seeing individual HTs (and not complete cores) 
>>>>> assigned, and I don’t know the source of that confusion.
>>>>> 
>>>>> 
>>>>>> On Oct 3, 2015, at 8:28 AM, marcin.krotkiewski < 
>>>>>> <mailto:marcin.krotkiew...@gmail.com>marcin.krotkiew...@gmail.com 
>>>>>> <mailto:marcin.krotkiew...@gmail.com>> wrote:
>>>>>> 
>>>>>> 
>>>>>> On 10/03/2015 04:38 PM, Ralph Castain wrote:
>>>>>>> If mpirun isn’t trying to do any binding, then you will of course get 
>>>>>>> the right mapping as we’ll just inherit whatever we received.
>>>>>> Yes. I meant that whatever you received (what SLURM gives) is a correct 
>>>>>> cpu map and assigns _whole_ CPUs, not a single HT to MPI processes. In 
>>>>>> the case mentioned earlier openmpi should start 6 tasks on c1-30. If HT 
>>>>>> would be treated as separate and independent cores, sched_getaffinity of 
>>>>>> an MPI process started on c1-30 would return a map with 6 entries only. 
>>>>>> In my case it returns a map with 12 entries - 2 for each core. So one  
>>>>>> process is in fact allocated both HTs, not only one. Is what I'm saying 
>>>>>> correct?
>>>>>> 
>>>>>>> Looking at your output, it’s pretty clear that you are getting 
>>>>>>> independent HTs assigned and not full cores. 
>>>>>> How do you mean? Is the above understanding wrong? I would expect that 
>>>>>> on c1-30 with --bind-to core openmpi should bind to logical cores 0 and 
>>>>>> 16 (rank 0), 1 and 17 (rank 2) and so on. All those logical cores are 
>>>>>> available in sched_getaffinity map, and there is twice as many logical 
>>>>>> cores as there are MPI processes started on the node.
>>>>>> 
>>>>>>> My guess is that something in slurm has changed such that it detects 
>>>>>>> that HT has been enabled, and then begins treating the HTs as 
>>>>>>> completely independent cpus.
>>>>>>> 
>>>>>>> Try changing “-bind-to core” to “-bind-to hwthread  -use-hwthread-cpus” 
>>>>>>> and see if that works
>>>>>>> 
>>>>>> I have and the binding is wrong. For example, I got this output
>>>>>> 
>>>>>> rank 0 @ compute-1-30.local  0,
>>>>>> rank 1 @ compute-1-30.local  16,
>>>>>> 
>>>>>> Which means that two ranks have been bound to the same physical core 
>>>>>> (logical cores 0 and 16 are two HTs of the same core). If I use 
>>>>>> --bind-to core, I get the following correct binding
>>>>>> 
>>>>>> rank 0 @ compute-1-30.local  0, 16,
>>>>>> 
>>>>>> The problem is many other ranks get bad binding with 'rank XXX is not 
>>>>>> bound (or bound to all available processors)' warning.
>>>>>> 
>>>>>> But I think I was not entirely correct saying that 1.10.1rc1 did not fix 
>>>>>> things. It still might have improved something, but not everything. 
>>>>>> Consider this job:
>>>>>> 
>>>>>> SLURM_JOB_CPUS_PER_NODE='5,4,6,5(x2),7,5,9,5,7,6'
>>>>>> SLURM_JOB_NODELIST='c8-[31,34],c9-[30-32,35-36],c10-[31-34]'
>>>>>> 
>>>>>> If I run 32 tasks as follows (with 1.10.1rc1)
>>>>>> 
>>>>>> mpirun --hetero-nodes --report-bindings --bind-to core -np 32 ./affinity
>>>>>> 
>>>>>> I get the following error:
>>>>>> 
>>>>>> --------------------------------------------------------------------------
>>>>>> A request was made to bind to that would result in binding more
>>>>>> processes than cpus on a resource:
>>>>>> 
>>>>>>    Bind to:     CORE
>>>>>>    Node:        c9-31
>>>>>>    #processes:  2
>>>>>>    #cpus:       1
>>>>>> 
>>>>>> You can override this protection by adding the "overload-allowed"
>>>>>> option to your binding directive.
>>>>>> --------------------------------------------------------------------------
>>>>>> 
>>>>>> 
>>>>>> If I now use --bind-to core:overload-allowed, then openmpi starts and 
>>>>>> _most_ of the threads are bound correctly (i.e., map contains two 
>>>>>> logical cores in ALL cases), except this case that required the overload 
>>>>>> flag:
>>>>>> 
>>>>>> rank 15 @ compute-9-31.local   1, 17,
>>>>>> rank 16 @ compute-9-31.local  11, 27,
>>>>>> rank 17 @ compute-9-31.local   2, 18, 
>>>>>> rank 18 @ compute-9-31.local  12, 28,
>>>>>> rank 19 @ compute-9-31.local   1, 17,
>>>>>> 
>>>>>> Note pair 1,17 is used twice. The original SLURM delivered map (no 
>>>>>> binding) on this node is
>>>>>> 
>>>>>> rank 15 @ compute-9-31.local  1, 2, 11, 12, 13, 17, 18, 27, 28, 29, 
>>>>>> rank 16 @ compute-9-31.local  1, 2, 11, 12, 13, 17, 18, 27, 28, 29,
>>>>>> rank 17 @ compute-9-31.local  1, 2, 11, 12, 13, 17, 18, 27, 28, 29,
>>>>>> rank 18 @ compute-9-31.local  1, 2, 11, 12, 13, 17, 18, 27, 28, 29,
>>>>>> rank 19 @ compute-9-31.local  1, 2, 11, 12, 13, 17, 18, 27, 28, 29,
>>>>>> 
>>>>>> Why does openmpi use cores (1,17) twice instead of using core (13,29)? 
>>>>>> Clearly, the original SLURM-delivered map has 5 CPUs included, enough 
>>>>>> for 5 MPI processes. 
>>>>>> 
>>>>>> Cheers,
>>>>>> 
>>>>>> Marcin
>>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> On Oct 3, 2015, at 7:12 AM, marcin.krotkiewski < 
>>>>>>>> <mailto:marcin.krotkiew...@gmail.com>marcin.krotkiew...@gmail.com 
>>>>>>>> <mailto:marcin.krotkiew...@gmail.com>> wrote:
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On 10/03/2015 01:06 PM, Ralph Castain wrote:
>>>>>>>>> Thanks Marcin. Looking at this, I’m guessing that Slurm may be 
>>>>>>>>> treating HTs as “cores” - i.e., as independent cpus. Any chance that 
>>>>>>>>> is true?
>>>>>>>> Not to the best of my knowledge, and at least not intentionally. SLURM 
>>>>>>>> starts as many processes as there are physical cores, not threads. To 
>>>>>>>> verify this, consider this test case:
>>>>>>>> 
>>>>>>>> SLURM_JOB_CPUS_PER_NODE='6,8(x2),10'
>>>>>>>> SLURM_JOB_NODELIST='c1-[30-31],c2-[32,34]'
>>>>>>>> 
>>>>>>>> If I now execute only one mpi process WITH NO BINDING, it will go onto 
>>>>>>>> c1-30 and should have a map with 6 CPUs (12 hw threads). I run
>>>>>>>> 
>>>>>>>> mpirun --bind-to none -np 1 ./affinity
>>>>>>>> rank 0 @ compute-1-30.local  0, 1, 3, 4, 5, 6, 16, 17, 19, 20, 21, 22,
>>>>>>>> 
>>>>>>>> I have attached the affinity.c program FYI. Clearly, sched_getaffinity 
>>>>>>>> in my test code returns the correct map.
>>>>>>>> 
>>>>>>>> Now if I try to start all 32 processes in this example (still no 
>>>>>>>> binding):
>>>>>>>> 
>>>>>>>> rank 0 @ compute-1-30.local  0, 1, 3, 4, 5, 6, 16, 17, 19, 20, 21, 22,
>>>>>>>> rank 1 @ compute-1-30.local  0, 1, 3, 4, 5, 6, 16, 17, 19, 20, 21, 22,
>>>>>>>> rank 10 @ compute-1-31.local  2, 3, 7, 11, 12, 13, 14, 15, 18, 19, 23, 
>>>>>>>> 27, 28, 29, 30, 31,
>>>>>>>> rank 11 @ compute-1-31.local  2, 3, 7, 11, 12, 13, 14, 15, 18, 19, 23, 
>>>>>>>> 27, 28, 29, 30, 31,
>>>>>>>> rank 12 @ compute-1-31.local  2, 3, 7, 11, 12, 13, 14, 15, 18, 19, 23, 
>>>>>>>> 27, 28, 29, 30, 31,
>>>>>>>> rank 13 @ compute-1-31.local  2, 3, 7, 11, 12, 13, 14, 15, 18, 19, 23, 
>>>>>>>> 27, 28, 29, 30, 31,
>>>>>>>> rank 6 @ compute-1-31.local  2, 3, 7, 11, 12, 13, 14, 15, 18, 19, 23, 
>>>>>>>> 27, 28, 29, 30, 31,
>>>>>>>> rank 2 @ compute-1-30.local  0, 1, 3, 4, 5, 6, 16, 17, 19, 20, 21, 22,
>>>>>>>> rank 7 @ compute-1-31.local  2, 3, 7, 11, 12, 13, 14, 15, 18, 19, 23, 
>>>>>>>> 27, 28, 29, 30, 31,
>>>>>>>> rank 8 @ compute-1-31.local  2, 3, 7, 11, 12, 13, 14, 15, 18, 19, 23, 
>>>>>>>> 27, 28, 29, 30, 31,
>>>>>>>> rank 3 @ compute-1-30.local  0, 1, 3, 4, 5, 6, 16, 17, 19, 20, 21, 22,
>>>>>>>> rank 14 @ compute-2-32.local  7, 8, 9, 10, 11, 12, 13, 14, 23, 24, 25, 
>>>>>>>> 26, 27, 28, 29, 30,
>>>>>>>> rank 4 @ compute-1-30.local  0, 1, 3, 4, 5, 6, 16, 17, 19, 20, 21, 22,
>>>>>>>> rank 15 @ compute-2-32.local  7, 8, 9, 10, 11, 12, 13, 14, 23, 24, 25, 
>>>>>>>> 26, 27, 28, 29, 30,
>>>>>>>> rank 9 @ compute-1-31.local  2, 3, 7, 11, 12, 13, 14, 15, 18, 19, 23, 
>>>>>>>> 27, 28, 29, 30, 31,
>>>>>>>> rank 5 @ compute-1-30.local  0, 1, 3, 4, 5, 6, 16, 17, 19, 20, 21, 22,
>>>>>>>> rank 16 @ compute-2-32.local  7, 8, 9, 10, 11, 12, 13, 14, 23, 24, 25, 
>>>>>>>> 26, 27, 28, 29, 30,
>>>>>>>> rank 17 @ compute-2-32.local  7, 8, 9, 10, 11, 12, 13, 14, 23, 24, 25, 
>>>>>>>> 26, 27, 28, 29, 30,
>>>>>>>> rank 29 @ compute-2-34.local  0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17, 
>>>>>>>> 18, 19, 20, 21, 22, 23, 30, 31,
>>>>>>>> rank 30 @ compute-2-34.local  0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17, 
>>>>>>>> 18, 19, 20, 21, 22, 23, 30, 31,
>>>>>>>> rank 18 @ compute-2-32.local  7, 8, 9, 10, 11, 12, 13, 14, 23, 24, 25, 
>>>>>>>> 26, 27, 28, 29, 30,
>>>>>>>> rank 19 @ compute-2-32.local  7, 8, 9, 10, 11, 12, 13, 14, 23, 24, 25, 
>>>>>>>> 26, 27, 28, 29, 30,
>>>>>>>> rank 31 @ compute-2-34.local  0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17, 
>>>>>>>> 18, 19, 20, 21, 22, 23, 30, 31,
>>>>>>>> rank 20 @ compute-2-32.local  7, 8, 9, 10, 11, 12, 13, 14, 23, 24, 25, 
>>>>>>>> 26, 27, 28, 29, 30,
>>>>>>>> rank 22 @ compute-2-34.local  0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17, 
>>>>>>>> 18, 19, 20, 21, 22, 23, 30, 31,
>>>>>>>> rank 21 @ compute-2-32.local  7, 8, 9, 10, 11, 12, 13, 14, 23, 24, 25, 
>>>>>>>> 26, 27, 28, 29, 30,
>>>>>>>> rank 23 @ compute-2-34.local  0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17, 
>>>>>>>> 18, 19, 20, 21, 22, 23, 30, 31,
>>>>>>>> rank 24 @ compute-2-34.local  0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17, 
>>>>>>>> 18, 19, 20, 21, 22, 23, 30, 31,
>>>>>>>> rank 25 @ compute-2-34.local  0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17, 
>>>>>>>> 18, 19, 20, 21, 22, 23, 30, 31,
>>>>>>>> rank 26 @ compute-2-34.local  0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17, 
>>>>>>>> 18, 19, 20, 21, 22, 23, 30, 31,
>>>>>>>> rank 27 @ compute-2-34.local  0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17, 
>>>>>>>> 18, 19, 20, 21, 22, 23, 30, 31,
>>>>>>>> rank 28 @ compute-2-34.local  0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17, 
>>>>>>>> 18, 19, 20, 21, 22, 23, 30, 31,
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Still looks ok to me. If I now turn the binding on, openmpi fails:
>>>>>>>> 
>>>>>>>> 
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> A request was made to bind to that would result in binding more
>>>>>>>> processes than cpus on a resource:
>>>>>>>> 
>>>>>>>>   Bind to:     CORE
>>>>>>>>   Node:        c1-31
>>>>>>>>   #processes:  2
>>>>>>>>   #cpus:       1
>>>>>>>> 
>>>>>>>> You can override this protection by adding the "overload-allowed"
>>>>>>>> option to your binding directive.
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> 
>>>>>>>> The above tests were done with 1.10.1rc1, so it does not fix the 
>>>>>>>> problem.
>>>>>>>> 
>>>>>>>> Marcin
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> I’m wondering because bind-to core will attempt to bind your proc to 
>>>>>>>>> both HTs on the core. For some reason, we thought that 8.24 were HTs 
>>>>>>>>> on the same core, which is why we tried to bind to that pair of HTs. 
>>>>>>>>> We got an error because HT #24 was not allocated to us on node c6, 
>>>>>>>>> but HT #8 was.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> On Oct 3, 2015, at 2:43 AM, marcin.krotkiewski < 
>>>>>>>>>> <mailto:marcin.krotkiew...@gmail.com>marcin.krotkiew...@gmail.com 
>>>>>>>>>> <mailto:marcin.krotkiew...@gmail.com>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Hi, Ralph,
>>>>>>>>>> 
>>>>>>>>>> I submit my slurm job as follows
>>>>>>>>>> 
>>>>>>>>>> salloc --ntasks=64 --mem-per-cpu=2G --time=1:0:0
>>>>>>>>>> 
>>>>>>>>>> Effectively, the allocated CPU cores are spread amount many cluster 
>>>>>>>>>> nodes. SLURM uses cgroups to limit the CPU cores available for mpi 
>>>>>>>>>> processes running on a given cluster node. Compute nodes are 
>>>>>>>>>> 2-socket, 8-core E5-2670 systems with HyperThreading on
>>>>>>>>>> 
>>>>>>>>>> node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
>>>>>>>>>> node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31
>>>>>>>>>> node distances:
>>>>>>>>>> node   0   1
>>>>>>>>>>  0:  10  21
>>>>>>>>>>  1:  21  10
>>>>>>>>>> 
>>>>>>>>>> I run MPI program with command
>>>>>>>>>> 
>>>>>>>>>> mpirun  --report-bindings --bind-to core -np 64 ./affinity
>>>>>>>>>> 
>>>>>>>>>> The program simply runs sched_getaffinity for each process and 
>>>>>>>>>> prints out the result.
>>>>>>>>>> 
>>>>>>>>>> -----------
>>>>>>>>>> TEST RUN 1
>>>>>>>>>> -----------
>>>>>>>>>> For this particular job the problem is more severe: openmpi fails to 
>>>>>>>>>> run at all with error
>>>>>>>>>> 
>>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>>> Open MPI tried to bind a new process, but something went wrong.  The
>>>>>>>>>> process was killed without launching the target application.  Your 
>>>>>>>>>> job
>>>>>>>>>> will now abort.
>>>>>>>>>> 
>>>>>>>>>>  Local host:        c6-6
>>>>>>>>>>  Application name:  ./affinity
>>>>>>>>>>  Error message:     hwloc_set_cpubind returned "Error" for bitmap 
>>>>>>>>>> "8,24"
>>>>>>>>>>  Location:          odls_default_module.c:551
>>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>>> 
>>>>>>>>>> This is SLURM environment variables:
>>>>>>>>>> 
>>>>>>>>>> SLURM_JOBID=12712225
>>>>>>>>>> SLURM_JOB_CPUS_PER_NODE='3(x2),2,1(x3),2(x2),1,3(x3),5,1,4,1,3,2,3,7,1,5,6,1'
>>>>>>>>>> SLURM_JOB_ID=12712225
>>>>>>>>>> SLURM_JOB_NODELIST='c6-[3,6-8,12,14,17,22-23],c8-[4,7,9,17,20,28],c15-[5,10,18,20,22-24,28],c16-11'
>>>>>>>>>> SLURM_JOB_NUM_NODES=24
>>>>>>>>>> SLURM_JOB_PARTITION=normal
>>>>>>>>>> SLURM_MEM_PER_CPU=2048
>>>>>>>>>> SLURM_NNODES=24
>>>>>>>>>> SLURM_NODELIST='c6-[3,6-8,12,14,17,22-23],c8-[4,7,9,17,20,28],c15-[5,10,18,20,22-24,28],c16-11'
>>>>>>>>>> SLURM_NODE_ALIASES='(null)'
>>>>>>>>>> SLURM_NPROCS=64
>>>>>>>>>> SLURM_NTASKS=64
>>>>>>>>>> SLURM_SUBMIT_DIR=/cluster/home/marcink
>>>>>>>>>> SLURM_SUBMIT_HOST=login-0-2.local
>>>>>>>>>> SLURM_TASKS_PER_NODE='3(x2),2,1(x3),2(x2),1,3(x3),5,1,4,1,3,2,3,7,1,5,6,1'
>>>>>>>>>> 
>>>>>>>>>> There is also a lot of warnings like
>>>>>>>>>> 
>>>>>>>>>> [compute-6-6.local:20158] MCW rank 4 is not bound (or bound to all 
>>>>>>>>>> available processors)
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> -----------
>>>>>>>>>> TEST RUN 2
>>>>>>>>>> -----------
>>>>>>>>>> 
>>>>>>>>>> In another allocation I got a different error
>>>>>>>>>> 
>>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>>> A request was made to bind to that would result in binding more
>>>>>>>>>> processes than cpus on a resource:
>>>>>>>>>> 
>>>>>>>>>>   Bind to:     CORE
>>>>>>>>>>   Node:        c6-19
>>>>>>>>>>   #processes:  2
>>>>>>>>>>   #cpus:       1
>>>>>>>>>> 
>>>>>>>>>> You can override this protection by adding the "overload-allowed"
>>>>>>>>>> option to your binding directive.
>>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>>> 
>>>>>>>>>> and the allocation was the following
>>>>>>>>>> 
>>>>>>>>>> SLURM_JOBID=12712250
>>>>>>>>>> SLURM_JOB_CPUS_PER_NODE='3(x2),2,1,15,1,3,16,2,1,3(x2),2,5,4'
>>>>>>>>>> SLURM_JOB_ID=12712250
>>>>>>>>>> SLURM_JOB_NODELIST='c6-[3,6-8,12,14,17,19,22-23],c8-[4,7,9,17,28]'
>>>>>>>>>> SLURM_JOB_NUM_NODES=15
>>>>>>>>>> SLURM_JOB_PARTITION=normal
>>>>>>>>>> SLURM_MEM_PER_CPU=2048
>>>>>>>>>> SLURM_NNODES=15
>>>>>>>>>> SLURM_NODELIST='c6-[3,6-8,12,14,17,19,22-23],c8-[4,7,9,17,28]'
>>>>>>>>>> SLURM_NODE_ALIASES='(null)'
>>>>>>>>>> SLURM_NPROCS=64
>>>>>>>>>> SLURM_NTASKS=64
>>>>>>>>>> SLURM_SUBMIT_DIR=/cluster/home/marcink
>>>>>>>>>> SLURM_SUBMIT_HOST=login-0-2.local
>>>>>>>>>> SLURM_TASKS_PER_NODE='3(x2),2,1,15,1,3,16,2,1,3(x2),2,5,4'
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> If in this case I run on only 32 cores
>>>>>>>>>> 
>>>>>>>>>> mpirun  --report-bindings --bind-to core -np 32 ./affinity
>>>>>>>>>> 
>>>>>>>>>> the process starts, but I get the original binding problem:
>>>>>>>>>> 
>>>>>>>>>> [compute-6-8.local:31414] MCW rank 8 is not bound (or bound to all 
>>>>>>>>>> available processors)
>>>>>>>>>> 
>>>>>>>>>> Running with --hetero-nodes yields exactly the same results
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Hope the above is useful. The problem with binding under SLURM with 
>>>>>>>>>> CPU cores spread over nodes seems to be very reproducible. It is 
>>>>>>>>>> actually very often that OpenMPI dies with some error like above. 
>>>>>>>>>> These tests were run with openmpi-1.8.8 and 1.10.0, both giving same 
>>>>>>>>>> results.
>>>>>>>>>> 
>>>>>>>>>> One more suggestion. The warning message (MCW rank 8 is not 
>>>>>>>>>> bound...) is ONLY displayed when I use --report-bindings. It is 
>>>>>>>>>> never shown if I leave out this option, and although the binding is 
>>>>>>>>>> wrong the user is not notified. I think it would be better to show 
>>>>>>>>>> this warning in all cases binding fails.
>>>>>>>>>> 
>>>>>>>>>> Let me know if you need more information. I can help to debug this - 
>>>>>>>>>> it is a rather crucial issue.
>>>>>>>>>> 
>>>>>>>>>> Thanks!
>>>>>>>>>> 
>>>>>>>>>> Marcin
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On 10/02/2015 11:49 PM, Ralph Castain wrote:
>>>>>>>>>>> Can you please send me the allocation request you made (so I can 
>>>>>>>>>>> see what you specified on the cmd line), and the mpirun cmd line?
>>>>>>>>>>> 
>>>>>>>>>>> Thanks
>>>>>>>>>>> Ralph
>>>>>>>>>>> 
>>>>>>>>>>>> On Oct 2, 2015, at 8:25 AM, Marcin Krotkiewski < 
>>>>>>>>>>>> <mailto:marcin.krotkiew...@gmail.com>marcin.krotkiew...@gmail.com 
>>>>>>>>>>>> <mailto:marcin.krotkiew...@gmail.com>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> Hi,
>>>>>>>>>>>> 
>>>>>>>>>>>> I fail to make OpenMPI bind to cores correctly when running from 
>>>>>>>>>>>> within SLURM-allocated CPU resources spread over a range of 
>>>>>>>>>>>> compute nodes in an otherwise homogeneous cluster. I have found 
>>>>>>>>>>>> this thread
>>>>>>>>>>>> 
>>>>>>>>>>>>  
>>>>>>>>>>>> <http://www.open-mpi.org/community/lists/users/2014/06/24682.php>http://www.open-mpi.org/community/lists/users/2014/06/24682.php
>>>>>>>>>>>>  <http://www.open-mpi.org/community/lists/users/2014/06/24682.php>
>>>>>>>>>>>> 
>>>>>>>>>>>> and did try to use what Ralph suggested there (--hetero-nodes), 
>>>>>>>>>>>> but it does not work (v. 1.10.0). When running with 
>>>>>>>>>>>> --report-bindings I get messages like
>>>>>>>>>>>> 
>>>>>>>>>>>> [compute-9-11.local:27571] MCW rank 10 is not bound (or bound to 
>>>>>>>>>>>> all available processors)
>>>>>>>>>>>> 
>>>>>>>>>>>> for all ranks outside of my first physical compute node. Moreover, 
>>>>>>>>>>>> everything works as expected if I ask SLURM to assign entire 
>>>>>>>>>>>> compute nodes. So it does look like Ralph's diagnose presented in 
>>>>>>>>>>>> that thread is correct, just the --hetero-nodes switch does not 
>>>>>>>>>>>> work for me.
>>>>>>>>>>>> 
>>>>>>>>>>>> I have written a short code that uses sched_getaffinity to print 
>>>>>>>>>>>> the effective bindings: all MPI ranks except of those on the first 
>>>>>>>>>>>> node are bound to all CPU cores allocated by SLURM.
>>>>>>>>>>>> 
>>>>>>>>>>>> Do I have to do something except of --hetero-nodes, or is this a 
>>>>>>>>>>>> problem that needs further investigation?
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks a lot!
>>>>>>>>>>>> 
>>>>>>>>>>>> Marcin
>>>>>>>>>>>> 
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>  <mailto:us...@open-mpi.org>us...@open-mpi.org 
>>>>>>>>>>>> <mailto:us...@open-mpi.org>
>>>>>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
>>>>>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>>>>>>>>>>> Link to this post: 
>>>>>>>>>>>> http://www.open-mpi.org/community/lists/users/2015/10/27770.php 
>>>>>>>>>>>> <http://www.open-mpi.org/community/lists/users/2015/10/27770.php>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> users mailing list
>>>>>>>>>>>  <mailto:us...@open-mpi.org>us...@open-mpi.org 
>>>>>>>>>>> <mailto:us...@open-mpi.org>
>>>>>>>>>>> Subscription:  
>>>>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>  <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>>>>>>>>>> Link to this post: 
>>>>>>>>>>> http://www.open-mpi.org/community/lists/users/2015/10/27774.php 
>>>>>>>>>>> <http://www.open-mpi.org/community/lists/users/2015/10/27774.php>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> users mailing list
>>>>>>>>>>  <mailto:us...@open-mpi.org>us...@open-mpi.org 
>>>>>>>>>> <mailto:us...@open-mpi.org>
>>>>>>>>>> Subscription:  
>>>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>  <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>>>>>>>>> Link to this post: 
>>>>>>>>>> http://www.open-mpi.org/community/lists/users/2015/10/27776.php 
>>>>>>>>>> <http://www.open-mpi.org/community/lists/users/2015/10/27776.php>
>>>>>>>>> _______________________________________________
>>>>>>>>> users mailing list
>>>>>>>>>  <mailto:us...@open-mpi.org>us...@open-mpi.org 
>>>>>>>>> <mailto:us...@open-mpi.org>
>>>>>>>>> Subscription:  
>>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>  <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>>>>>>>> Link to this post: 
>>>>>>>>> http://www.open-mpi.org/community/lists/users/2015/10/27778.php 
>>>>>>>>> <http://www.open-mpi.org/community/lists/users/2015/10/27778.php>
>>>>>>>> 
>>>>>>>> <affinity.c>_______________________________________________
>>>>>>>> users mailing list
>>>>>>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>>>>>>> Link to this post:  
>>>>>>>> <http://www.open-mpi.org/community/lists/users/2015/10/27781.php>http://www.open-mpi.org/community/lists/users/2015/10/27781.php
>>>>>>>>  <http://www.open-mpi.org/community/lists/users/2015/10/27781.php>
>>>>>>> 
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>>>>>> Link to this post: 
>>>>>>> http://www.open-mpi.org/community/lists/users/2015/10/27782.php 
>>>>>>> <http://www.open-mpi.org/community/lists/users/2015/10/27782.php>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>>>>>> Subscription:  
>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>  <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>>>>> Link to this post:  
>>>>>> <http://www.open-mpi.org/community/lists/users/2015/10/27783.php>http://www.open-mpi.org/community/lists/users/2015/10/27783.php
>>>>>>  <http://www.open-mpi.org/community/lists/users/2015/10/27783.php>
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>>>> Link to this post: 
>>>>> http://www.open-mpi.org/community/lists/users/2015/10/27784.php 
>>>>> <http://www.open-mpi.org/community/lists/users/2015/10/27784.php>
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>>> Link to this post: 
>>>> http://www.open-mpi.org/community/lists/users/2015/10/27785.php 
>>>> <http://www.open-mpi.org/community/lists/users/2015/10/27785.php>
>> 
>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2015/10/27787.php 
>> <http://www.open-mpi.org/community/lists/users/2015/10/27787.php>
> <out.1.10.1rc1.overload><out.1.10.1rc1><out.1.10.0>_______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/10/27788.php

Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

Reply via email to