Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

Ralph Castain Sat, 3 Oct 2015 13:09:14 -0400 (EDT)

What version of slurm is this? I might try to debug it here. I’m not sure where 
the problem lies just yet.



> On Oct 3, 2015, at 8:59 AM, marcin.krotkiewski <marcin.krotkiew...@gmail.com> 
> wrote:
> 
> Here is the output of lstopo. In short, (0,16) are core 0, (1,17) - core 1 
> etc.
> 
> Machine (64GB)
>   NUMANode L#0 (P#0 32GB)
>     Socket L#0 + L3 L#0 (20MB)
>       L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
>         PU L#0 (P#0)
>         PU L#1 (P#16)
>       L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
>         PU L#2 (P#1)
>         PU L#3 (P#17)
>       L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
>         PU L#4 (P#2)
>         PU L#5 (P#18)
>       L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
>         PU L#6 (P#3)
>         PU L#7 (P#19)
>       L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
>         PU L#8 (P#4)
>         PU L#9 (P#20)
>       L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
>         PU L#10 (P#5)
>         PU L#11 (P#21)
>       L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6
>         PU L#12 (P#6)
>         PU L#13 (P#22)
>       L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7
>         PU L#14 (P#7)
>         PU L#15 (P#23)
>     HostBridge L#0
>       PCIBridge
>         PCI 8086:1521
>           Net L#0 "eth0"
>         PCI 8086:1521
>           Net L#1 "eth1"
>       PCIBridge
>         PCI 15b3:1003
>           Net L#2 "ib0"
>           OpenFabrics L#3 "mlx4_0"
>       PCIBridge
>         PCI 102b:0532
>       PCI 8086:1d02
>         Block L#4 "sda"
>   NUMANode L#1 (P#1 32GB) + Socket L#1 + L3 L#1 (20MB)
>     L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8
>       PU L#16 (P#8)
>       PU L#17 (P#24)
>     L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9
>       PU L#18 (P#9)
>       PU L#19 (P#25)
>     L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10
>       PU L#20 (P#10)
>       PU L#21 (P#26)
>     L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11
>       PU L#22 (P#11)
>       PU L#23 (P#27)
>     L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12
>       PU L#24 (P#12)
>       PU L#25 (P#28)
>     L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13
>       PU L#26 (P#13)
>       PU L#27 (P#29)
>     L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14
>       PU L#28 (P#14)
>       PU L#29 (P#30)
>     L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15
>       PU L#30 (P#15)
>       PU L#31 (P#31)
> 
> 
> 
> On 10/03/2015 05:46 PM, Ralph Castain wrote:
>> Maybe I’m just misreading your HT map - that slurm nodelist syntax is a new 
>> one to me, but they tend to change things around. Could you run lstopo on 
>> one of those compute nodes and send the output?
>> 
>> I’m just suspicious because I’m not seeing a clear pairing of HT numbers in 
>> your output, but HT numbering is BIOS-specific and I may just not be 
>> understanding your particular pattern. Our error message is clearly 
>> indicating that we are seeing individual HTs (and not complete cores) 
>> assigned, and I don’t know the source of that confusion.
>> 
>> 
>>> On Oct 3, 2015, at 8:28 AM, marcin.krotkiewski 
>>> <marcin.krotkiew...@gmail.com <mailto:marcin.krotkiew...@gmail.com>> wrote:
>>> 
>>> 
>>> On 10/03/2015 04:38 PM, Ralph Castain wrote:
>>>> If mpirun isn’t trying to do any binding, then you will of course get the 
>>>> right mapping as we’ll just inherit whatever we received.
>>> Yes. I meant that whatever you received (what SLURM gives) is a correct cpu 
>>> map and assigns _whole_ CPUs, not a single HT to MPI processes. In the case 
>>> mentioned earlier openmpi should start 6 tasks on c1-30. If HT would be 
>>> treated as separate and independent cores, sched_getaffinity of an MPI 
>>> process started on c1-30 would return a map with 6 entries only. In my case 
>>> it returns a map with 12 entries - 2 for each core. So one  process is in 
>>> fact allocated both HTs, not only one. Is what I'm saying correct?
>>> 
>>>> Looking at your output, it’s pretty clear that you are getting independent 
>>>> HTs assigned and not full cores. 
>>> How do you mean? Is the above understanding wrong? I would expect that on 
>>> c1-30 with --bind-to core openmpi should bind to logical cores 0 and 16 
>>> (rank 0), 1 and 17 (rank 2) and so on. All those logical cores are 
>>> available in sched_getaffinity map, and there is twice as many logical 
>>> cores as there are MPI processes started on the node.
>>> 
>>>> My guess is that something in slurm has changed such that it detects that 
>>>> HT has been enabled, and then begins treating the HTs as completely 
>>>> independent cpus.
>>>> 
>>>> Try changing “-bind-to core” to “-bind-to hwthread  -use-hwthread-cpus” 
>>>> and see if that works
>>>> 
>>> I have and the binding is wrong. For example, I got this output
>>> 
>>> rank 0 @ compute-1-30.local  0,
>>> rank 1 @ compute-1-30.local  16,
>>> 
>>> Which means that two ranks have been bound to the same physical core 
>>> (logical cores 0 and 16 are two HTs of the same core). If I use --bind-to 
>>> core, I get the following correct binding
>>> 
>>> rank 0 @ compute-1-30.local  0, 16,
>>> 
>>> The problem is many other ranks get bad binding with 'rank XXX is not bound 
>>> (or bound to all available processors)' warning.
>>> 
>>> But I think I was not entirely correct saying that 1.10.1rc1 did not fix 
>>> things. It still might have improved something, but not everything. 
>>> Consider this job:
>>> 
>>> SLURM_JOB_CPUS_PER_NODE='5,4,6,5(x2),7,5,9,5,7,6'
>>> SLURM_JOB_NODELIST='c8-[31,34],c9-[30-32,35-36],c10-[31-34]'
>>> 
>>> If I run 32 tasks as follows (with 1.10.1rc1)
>>> 
>>> mpirun --hetero-nodes --report-bindings --bind-to core -np 32 ./affinity
>>> 
>>> I get the following error:
>>> 
>>> --------------------------------------------------------------------------
>>> A request was made to bind to that would result in binding more
>>> processes than cpus on a resource:
>>> 
>>>    Bind to:     CORE
>>>    Node:        c9-31
>>>    #processes:  2
>>>    #cpus:       1
>>> 
>>> You can override this protection by adding the "overload-allowed"
>>> option to your binding directive.
>>> --------------------------------------------------------------------------
>>> 
>>> 
>>> If I now use --bind-to core:overload-allowed, then openmpi starts and 
>>> _most_ of the threads are bound correctly (i.e., map contains two logical 
>>> cores in ALL cases), except this case that required the overload flag:
>>> 
>>> rank 15 @ compute-9-31.local   1, 17,
>>> rank 16 @ compute-9-31.local  11, 27,
>>> rank 17 @ compute-9-31.local   2, 18, 
>>> rank 18 @ compute-9-31.local  12, 28,
>>> rank 19 @ compute-9-31.local   1, 17,
>>> 
>>> Note pair 1,17 is used twice. The original SLURM delivered map (no binding) 
>>> on this node is
>>> 
>>> rank 15 @ compute-9-31.local  1, 2, 11, 12, 13, 17, 18, 27, 28, 29, 
>>> rank 16 @ compute-9-31.local  1, 2, 11, 12, 13, 17, 18, 27, 28, 29,
>>> rank 17 @ compute-9-31.local  1, 2, 11, 12, 13, 17, 18, 27, 28, 29,
>>> rank 18 @ compute-9-31.local  1, 2, 11, 12, 13, 17, 18, 27, 28, 29,
>>> rank 19 @ compute-9-31.local  1, 2, 11, 12, 13, 17, 18, 27, 28, 29,
>>> 
>>> Why does openmpi use cores (1,17) twice instead of using core (13,29)? 
>>> Clearly, the original SLURM-delivered map has 5 CPUs included, enough for 5 
>>> MPI processes. 
>>> 
>>> Cheers,
>>> 
>>> Marcin
>>> 
>>> 
>>>> 
>>>>> On Oct 3, 2015, at 7:12 AM, marcin.krotkiewski 
>>>>> <marcin.krotkiew...@gmail.com <mailto:marcin.krotkiew...@gmail.com>> 
>>>>> wrote:
>>>>> 
>>>>> 
>>>>> On 10/03/2015 01:06 PM, Ralph Castain wrote:
>>>>>> Thanks Marcin. Looking at this, I’m guessing that Slurm may be treating 
>>>>>> HTs as “cores” - i.e., as independent cpus. Any chance that is true?
>>>>> Not to the best of my knowledge, and at least not intentionally. SLURM 
>>>>> starts as many processes as there are physical cores, not threads. To 
>>>>> verify this, consider this test case:
>>>>> 
>>>>> SLURM_JOB_CPUS_PER_NODE='6,8(x2),10'
>>>>> SLURM_JOB_NODELIST='c1-[30-31],c2-[32,34]'
>>>>> 
>>>>> If I now execute only one mpi process WITH NO BINDING, it will go onto 
>>>>> c1-30 and should have a map with 6 CPUs (12 hw threads). I run
>>>>> 
>>>>> mpirun --bind-to none -np 1 ./affinity
>>>>> rank 0 @ compute-1-30.local  0, 1, 3, 4, 5, 6, 16, 17, 19, 20, 21, 22,
>>>>> 
>>>>> I have attached the affinity.c program FYI. Clearly, sched_getaffinity in 
>>>>> my test code returns the correct map.
>>>>> 
>>>>> Now if I try to start all 32 processes in this example (still no binding):
>>>>> 
>>>>> rank 0 @ compute-1-30.local  0, 1, 3, 4, 5, 6, 16, 17, 19, 20, 21, 22,
>>>>> rank 1 @ compute-1-30.local  0, 1, 3, 4, 5, 6, 16, 17, 19, 20, 21, 22,
>>>>> rank 10 @ compute-1-31.local  2, 3, 7, 11, 12, 13, 14, 15, 18, 19, 23, 
>>>>> 27, 28, 29, 30, 31,
>>>>> rank 11 @ compute-1-31.local  2, 3, 7, 11, 12, 13, 14, 15, 18, 19, 23, 
>>>>> 27, 28, 29, 30, 31,
>>>>> rank 12 @ compute-1-31.local  2, 3, 7, 11, 12, 13, 14, 15, 18, 19, 23, 
>>>>> 27, 28, 29, 30, 31,
>>>>> rank 13 @ compute-1-31.local  2, 3, 7, 11, 12, 13, 14, 15, 18, 19, 23, 
>>>>> 27, 28, 29, 30, 31,
>>>>> rank 6 @ compute-1-31.local  2, 3, 7, 11, 12, 13, 14, 15, 18, 19, 23, 27, 
>>>>> 28, 29, 30, 31,
>>>>> rank 2 @ compute-1-30.local  0, 1, 3, 4, 5, 6, 16, 17, 19, 20, 21, 22,
>>>>> rank 7 @ compute-1-31.local  2, 3, 7, 11, 12, 13, 14, 15, 18, 19, 23, 27, 
>>>>> 28, 29, 30, 31,
>>>>> rank 8 @ compute-1-31.local  2, 3, 7, 11, 12, 13, 14, 15, 18, 19, 23, 27, 
>>>>> 28, 29, 30, 31,
>>>>> rank 3 @ compute-1-30.local  0, 1, 3, 4, 5, 6, 16, 17, 19, 20, 21, 22,
>>>>> rank 14 @ compute-2-32.local  7, 8, 9, 10, 11, 12, 13, 14, 23, 24, 25, 
>>>>> 26, 27, 28, 29, 30,
>>>>> rank 4 @ compute-1-30.local  0, 1, 3, 4, 5, 6, 16, 17, 19, 20, 21, 22,
>>>>> rank 15 @ compute-2-32.local  7, 8, 9, 10, 11, 12, 13, 14, 23, 24, 25, 
>>>>> 26, 27, 28, 29, 30,
>>>>> rank 9 @ compute-1-31.local  2, 3, 7, 11, 12, 13, 14, 15, 18, 19, 23, 27, 
>>>>> 28, 29, 30, 31,
>>>>> rank 5 @ compute-1-30.local  0, 1, 3, 4, 5, 6, 16, 17, 19, 20, 21, 22,
>>>>> rank 16 @ compute-2-32.local  7, 8, 9, 10, 11, 12, 13, 14, 23, 24, 25, 
>>>>> 26, 27, 28, 29, 30,
>>>>> rank 17 @ compute-2-32.local  7, 8, 9, 10, 11, 12, 13, 14, 23, 24, 25, 
>>>>> 26, 27, 28, 29, 30,
>>>>> rank 29 @ compute-2-34.local  0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17, 18, 
>>>>> 19, 20, 21, 22, 23, 30, 31,
>>>>> rank 30 @ compute-2-34.local  0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17, 18, 
>>>>> 19, 20, 21, 22, 23, 30, 31,
>>>>> rank 18 @ compute-2-32.local  7, 8, 9, 10, 11, 12, 13, 14, 23, 24, 25, 
>>>>> 26, 27, 28, 29, 30,
>>>>> rank 19 @ compute-2-32.local  7, 8, 9, 10, 11, 12, 13, 14, 23, 24, 25, 
>>>>> 26, 27, 28, 29, 30,
>>>>> rank 31 @ compute-2-34.local  0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17, 18, 
>>>>> 19, 20, 21, 22, 23, 30, 31,
>>>>> rank 20 @ compute-2-32.local  7, 8, 9, 10, 11, 12, 13, 14, 23, 24, 25, 
>>>>> 26, 27, 28, 29, 30,
>>>>> rank 22 @ compute-2-34.local  0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17, 18, 
>>>>> 19, 20, 21, 22, 23, 30, 31,
>>>>> rank 21 @ compute-2-32.local  7, 8, 9, 10, 11, 12, 13, 14, 23, 24, 25, 
>>>>> 26, 27, 28, 29, 30,
>>>>> rank 23 @ compute-2-34.local  0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17, 18, 
>>>>> 19, 20, 21, 22, 23, 30, 31,
>>>>> rank 24 @ compute-2-34.local  0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17, 18, 
>>>>> 19, 20, 21, 22, 23, 30, 31,
>>>>> rank 25 @ compute-2-34.local  0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17, 18, 
>>>>> 19, 20, 21, 22, 23, 30, 31,
>>>>> rank 26 @ compute-2-34.local  0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17, 18, 
>>>>> 19, 20, 21, 22, 23, 30, 31,
>>>>> rank 27 @ compute-2-34.local  0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17, 18, 
>>>>> 19, 20, 21, 22, 23, 30, 31,
>>>>> rank 28 @ compute-2-34.local  0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17, 18, 
>>>>> 19, 20, 21, 22, 23, 30, 31,
>>>>> 
>>>>> 
>>>>> Still looks ok to me. If I now turn the binding on, openmpi fails:
>>>>> 
>>>>> 
>>>>> --------------------------------------------------------------------------
>>>>> A request was made to bind to that would result in binding more
>>>>> processes than cpus on a resource:
>>>>> 
>>>>>   Bind to:     CORE
>>>>>   Node:        c1-31
>>>>>   #processes:  2
>>>>>   #cpus:       1
>>>>> 
>>>>> You can override this protection by adding the "overload-allowed"
>>>>> option to your binding directive.
>>>>> --------------------------------------------------------------------------
>>>>> 
>>>>> The above tests were done with 1.10.1rc1, so it does not fix the problem.
>>>>> 
>>>>> Marcin
>>>>> 
>>>>> 
>>>>>> I’m wondering because bind-to core will attempt to bind your proc to 
>>>>>> both HTs on the core. For some reason, we thought that 8.24 were HTs on 
>>>>>> the same core, which is why we tried to bind to that pair of HTs. We got 
>>>>>> an error because HT #24 was not allocated to us on node c6, but HT #8 
>>>>>> was.
>>>>>> 
>>>>>> 
>>>>>>> On Oct 3, 2015, at 2:43 AM, marcin.krotkiewski < 
>>>>>>> <mailto:marcin.krotkiew...@gmail.com>marcin.krotkiew...@gmail.com 
>>>>>>> <mailto:marcin.krotkiew...@gmail.com>> wrote:
>>>>>>> 
>>>>>>> Hi, Ralph,
>>>>>>> 
>>>>>>> I submit my slurm job as follows
>>>>>>> 
>>>>>>> salloc --ntasks=64 --mem-per-cpu=2G --time=1:0:0
>>>>>>> 
>>>>>>> Effectively, the allocated CPU cores are spread amount many cluster 
>>>>>>> nodes. SLURM uses cgroups to limit the CPU cores available for mpi 
>>>>>>> processes running on a given cluster node. Compute nodes are 2-socket, 
>>>>>>> 8-core E5-2670 systems with HyperThreading on
>>>>>>> 
>>>>>>> node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
>>>>>>> node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31
>>>>>>> node distances:
>>>>>>> node   0   1
>>>>>>>  0:  10  21
>>>>>>>  1:  21  10
>>>>>>> 
>>>>>>> I run MPI program with command
>>>>>>> 
>>>>>>> mpirun  --report-bindings --bind-to core -np 64 ./affinity
>>>>>>> 
>>>>>>> The program simply runs sched_getaffinity for each process and prints 
>>>>>>> out the result.
>>>>>>> 
>>>>>>> -----------
>>>>>>> TEST RUN 1
>>>>>>> -----------
>>>>>>> For this particular job the problem is more severe: openmpi fails to 
>>>>>>> run at all with error
>>>>>>> 
>>>>>>> --------------------------------------------------------------------------
>>>>>>> Open MPI tried to bind a new process, but something went wrong.  The
>>>>>>> process was killed without launching the target application.  Your job
>>>>>>> will now abort.
>>>>>>> 
>>>>>>>  Local host:        c6-6
>>>>>>>  Application name:  ./affinity
>>>>>>>  Error message:     hwloc_set_cpubind returned "Error" for bitmap "8,24"
>>>>>>>  Location:          odls_default_module.c:551
>>>>>>> --------------------------------------------------------------------------
>>>>>>> 
>>>>>>> This is SLURM environment variables:
>>>>>>> 
>>>>>>> SLURM_JOBID=12712225
>>>>>>> SLURM_JOB_CPUS_PER_NODE='3(x2),2,1(x3),2(x2),1,3(x3),5,1,4,1,3,2,3,7,1,5,6,1'
>>>>>>> SLURM_JOB_ID=12712225
>>>>>>> SLURM_JOB_NODELIST='c6-[3,6-8,12,14,17,22-23],c8-[4,7,9,17,20,28],c15-[5,10,18,20,22-24,28],c16-11'
>>>>>>> SLURM_JOB_NUM_NODES=24
>>>>>>> SLURM_JOB_PARTITION=normal
>>>>>>> SLURM_MEM_PER_CPU=2048
>>>>>>> SLURM_NNODES=24
>>>>>>> SLURM_NODELIST='c6-[3,6-8,12,14,17,22-23],c8-[4,7,9,17,20,28],c15-[5,10,18,20,22-24,28],c16-11'
>>>>>>> SLURM_NODE_ALIASES='(null)'
>>>>>>> SLURM_NPROCS=64
>>>>>>> SLURM_NTASKS=64
>>>>>>> SLURM_SUBMIT_DIR=/cluster/home/marcink
>>>>>>> SLURM_SUBMIT_HOST=login-0-2.local
>>>>>>> SLURM_TASKS_PER_NODE='3(x2),2,1(x3),2(x2),1,3(x3),5,1,4,1,3,2,3,7,1,5,6,1'
>>>>>>> 
>>>>>>> There is also a lot of warnings like
>>>>>>> 
>>>>>>> [compute-6-6.local:20158] MCW rank 4 is not bound (or bound to all 
>>>>>>> available processors)
>>>>>>> 
>>>>>>> 
>>>>>>> -----------
>>>>>>> TEST RUN 2
>>>>>>> -----------
>>>>>>> 
>>>>>>> In another allocation I got a different error
>>>>>>> 
>>>>>>> --------------------------------------------------------------------------
>>>>>>> A request was made to bind to that would result in binding more
>>>>>>> processes than cpus on a resource:
>>>>>>> 
>>>>>>>   Bind to:     CORE
>>>>>>>   Node:        c6-19
>>>>>>>   #processes:  2
>>>>>>>   #cpus:       1
>>>>>>> 
>>>>>>> You can override this protection by adding the "overload-allowed"
>>>>>>> option to your binding directive.
>>>>>>> --------------------------------------------------------------------------
>>>>>>> 
>>>>>>> and the allocation was the following
>>>>>>> 
>>>>>>> SLURM_JOBID=12712250
>>>>>>> SLURM_JOB_CPUS_PER_NODE='3(x2),2,1,15,1,3,16,2,1,3(x2),2,5,4'
>>>>>>> SLURM_JOB_ID=12712250
>>>>>>> SLURM_JOB_NODELIST='c6-[3,6-8,12,14,17,19,22-23],c8-[4,7,9,17,28]'
>>>>>>> SLURM_JOB_NUM_NODES=15
>>>>>>> SLURM_JOB_PARTITION=normal
>>>>>>> SLURM_MEM_PER_CPU=2048
>>>>>>> SLURM_NNODES=15
>>>>>>> SLURM_NODELIST='c6-[3,6-8,12,14,17,19,22-23],c8-[4,7,9,17,28]'
>>>>>>> SLURM_NODE_ALIASES='(null)'
>>>>>>> SLURM_NPROCS=64
>>>>>>> SLURM_NTASKS=64
>>>>>>> SLURM_SUBMIT_DIR=/cluster/home/marcink
>>>>>>> SLURM_SUBMIT_HOST=login-0-2.local
>>>>>>> SLURM_TASKS_PER_NODE='3(x2),2,1,15,1,3,16,2,1,3(x2),2,5,4'
>>>>>>> 
>>>>>>> 
>>>>>>> If in this case I run on only 32 cores
>>>>>>> 
>>>>>>> mpirun  --report-bindings --bind-to core -np 32 ./affinity
>>>>>>> 
>>>>>>> the process starts, but I get the original binding problem:
>>>>>>> 
>>>>>>> [compute-6-8.local:31414] MCW rank 8 is not bound (or bound to all 
>>>>>>> available processors)
>>>>>>> 
>>>>>>> Running with --hetero-nodes yields exactly the same results
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Hope the above is useful. The problem with binding under SLURM with CPU 
>>>>>>> cores spread over nodes seems to be very reproducible. It is actually 
>>>>>>> very often that OpenMPI dies with some error like above. These tests 
>>>>>>> were run with openmpi-1.8.8 and 1.10.0, both giving same results.
>>>>>>> 
>>>>>>> One more suggestion. The warning message (MCW rank 8 is not bound...) 
>>>>>>> is ONLY displayed when I use --report-bindings. It is never shown if I 
>>>>>>> leave out this option, and although the binding is wrong the user is 
>>>>>>> not notified. I think it would be better to show this warning in all 
>>>>>>> cases binding fails.
>>>>>>> 
>>>>>>> Let me know if you need more information. I can help to debug this - it 
>>>>>>> is a rather crucial issue.
>>>>>>> 
>>>>>>> Thanks!
>>>>>>> 
>>>>>>> Marcin
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On 10/02/2015 11:49 PM, Ralph Castain wrote:
>>>>>>>> Can you please send me the allocation request you made (so I can see 
>>>>>>>> what you specified on the cmd line), and the mpirun cmd line?
>>>>>>>> 
>>>>>>>> Thanks
>>>>>>>> Ralph
>>>>>>>> 
>>>>>>>>> On Oct 2, 2015, at 8:25 AM, Marcin Krotkiewski 
>>>>>>>>> <marcin.krotkiew...@gmail.com <mailto:marcin.krotkiew...@gmail.com>> 
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> Hi,
>>>>>>>>> 
>>>>>>>>> I fail to make OpenMPI bind to cores correctly when running from 
>>>>>>>>> within SLURM-allocated CPU resources spread over a range of compute 
>>>>>>>>> nodes in an otherwise homogeneous cluster. I have found this thread
>>>>>>>>> 
>>>>>>>>> http://www.open-mpi.org/community/lists/users/2014/06/24682.php 
>>>>>>>>> <http://www.open-mpi.org/community/lists/users/2014/06/24682.php>
>>>>>>>>> 
>>>>>>>>> and did try to use what Ralph suggested there (--hetero-nodes), but 
>>>>>>>>> it does not work (v. 1.10.0). When running with --report-bindings I 
>>>>>>>>> get messages like
>>>>>>>>> 
>>>>>>>>> [compute-9-11.local:27571] MCW rank 10 is not bound (or bound to all 
>>>>>>>>> available processors)
>>>>>>>>> 
>>>>>>>>> for all ranks outside of my first physical compute node. Moreover, 
>>>>>>>>> everything works as expected if I ask SLURM to assign entire compute 
>>>>>>>>> nodes. So it does look like Ralph's diagnose presented in that thread 
>>>>>>>>> is correct, just the --hetero-nodes switch does not work for me.
>>>>>>>>> 
>>>>>>>>> I have written a short code that uses sched_getaffinity to print the 
>>>>>>>>> effective bindings: all MPI ranks except of those on the first node 
>>>>>>>>> are bound to all CPU cores allocated by SLURM.
>>>>>>>>> 
>>>>>>>>> Do I have to do something except of --hetero-nodes, or is this a 
>>>>>>>>> problem that needs further investigation?
>>>>>>>>> 
>>>>>>>>> Thanks a lot!
>>>>>>>>> 
>>>>>>>>> Marcin
>>>>>>>>> 
>>>>>>>>> _______________________________________________
>>>>>>>>> users mailing list
>>>>>>>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>>>>>>>>> Subscription:  
>>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>  <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>>>>>>>> Link to this post:  
>>>>>>>>> <http://www.open-mpi.org/community/lists/users/2015/10/27770.php>http://www.open-mpi.org/community/lists/users/2015/10/27770.php
>>>>>>>>>  <http://www.open-mpi.org/community/lists/users/2015/10/27770.php>
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>>>>>>>> Subscription:  
>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>  <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>>>>>>> Link to this post:  
>>>>>>>> <http://www.open-mpi.org/community/lists/users/2015/10/27774.php>http://www.open-mpi.org/community/lists/users/2015/10/27774.php
>>>>>>>>  <http://www.open-mpi.org/community/lists/users/2015/10/27774.php>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>>>>>>> Subscription:  
>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>  <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>>>>>> Link to this post:  
>>>>>>> <http://www.open-mpi.org/community/lists/users/2015/10/27776.php>http://www.open-mpi.org/community/lists/users/2015/10/27776.php
>>>>>>>  <http://www.open-mpi.org/community/lists/users/2015/10/27776.php>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>>>>>> Subscription:  
>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>  <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>>>>> Link to this post:  
>>>>>> <http://www.open-mpi.org/community/lists/users/2015/10/27778.php>http://www.open-mpi.org/community/lists/users/2015/10/27778.php
>>>>>>  <http://www.open-mpi.org/community/lists/users/2015/10/27778.php>
>>>>> 
>>>>> <affinity.c>_______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>>>>> Subscription:  
>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>  <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>>>> Link to this post:  
>>>>> <http://www.open-mpi.org/community/lists/users/2015/10/27781.php>http://www.open-mpi.org/community/lists/users/2015/10/27781.php
>>>>>  <http://www.open-mpi.org/community/lists/users/2015/10/27781.php>
>>>> 
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>>> Link to this post: 
>>>> http://www.open-mpi.org/community/lists/users/2015/10/27782.php 
>>>> <http://www.open-mpi.org/community/lists/users/2015/10/27782.php>
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>>> Subscription:  
>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>  <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>> Link to this post:  
>>> <http://www.open-mpi.org/community/lists/users/2015/10/27783.php>http://www.open-mpi.org/community/lists/users/2015/10/27783.php
>>>  <http://www.open-mpi.org/community/lists/users/2015/10/27783.php>
>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2015/10/27784.php 
>> <http://www.open-mpi.org/community/lists/users/2015/10/27784.php>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/10/27785.php

Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

Reply via email to