Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

Ralph Castain Sat, 3 Oct 2015 13:17:54 -0400 (EDT)

Rats - just realized I have no way to test this as none of the machines I can 
access are setup for cgroup-based multi-tenant. Is this a debug version of 
OMPI? If not, can you rebuild OMPI with —enable-debug?


Then please run it with —mca rmaps_base_verbose 10 and pass along the output.

Thanks
Ralph


> On Oct 3, 2015, at 10:09 AM, Ralph Castain <r...@open-mpi.org> wrote:
> 
> What version of slurm is this? I might try to debug it here. I’m not sure 
> where the problem lies just yet.
> 
> 
>> On Oct 3, 2015, at 8:59 AM, marcin.krotkiewski <marcin.krotkiew...@gmail.com 
>> <mailto:marcin.krotkiew...@gmail.com>> wrote:
>> 
>> Here is the output of lstopo. In short, (0,16) are core 0, (1,17) - core 1 
>> etc.
>> 
>> Machine (64GB)
>>   NUMANode L#0 (P#0 32GB)
>>     Socket L#0 + L3 L#0 (20MB)
>>       L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
>>         PU L#0 (P#0)
>>         PU L#1 (P#16)
>>       L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
>>         PU L#2 (P#1)
>>         PU L#3 (P#17)
>>       L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
>>         PU L#4 (P#2)
>>         PU L#5 (P#18)
>>       L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
>>         PU L#6 (P#3)
>>         PU L#7 (P#19)
>>       L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
>>         PU L#8 (P#4)
>>         PU L#9 (P#20)
>>       L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
>>         PU L#10 (P#5)
>>         PU L#11 (P#21)
>>       L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6
>>         PU L#12 (P#6)
>>         PU L#13 (P#22)
>>       L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7
>>         PU L#14 (P#7)
>>         PU L#15 (P#23)
>>     HostBridge L#0
>>       PCIBridge
>>         PCI 8086:1521
>>           Net L#0 "eth0"
>>         PCI 8086:1521
>>           Net L#1 "eth1"
>>       PCIBridge
>>         PCI 15b3:1003
>>           Net L#2 "ib0"
>>           OpenFabrics L#3 "mlx4_0"
>>       PCIBridge
>>         PCI 102b:0532
>>       PCI 8086:1d02
>>         Block L#4 "sda"
>>   NUMANode L#1 (P#1 32GB) + Socket L#1 + L3 L#1 (20MB)
>>     L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8
>>       PU L#16 (P#8)
>>       PU L#17 (P#24)
>>     L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9
>>       PU L#18 (P#9)
>>       PU L#19 (P#25)
>>     L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10
>>       PU L#20 (P#10)
>>       PU L#21 (P#26)
>>     L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11
>>       PU L#22 (P#11)
>>       PU L#23 (P#27)
>>     L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12
>>       PU L#24 (P#12)
>>       PU L#25 (P#28)
>>     L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13
>>       PU L#26 (P#13)
>>       PU L#27 (P#29)
>>     L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14
>>       PU L#28 (P#14)
>>       PU L#29 (P#30)
>>     L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15
>>       PU L#30 (P#15)
>>       PU L#31 (P#31)
>> 
>> 
>> 
>> On 10/03/2015 05:46 PM, Ralph Castain wrote:
>>> Maybe I’m just misreading your HT map - that slurm nodelist syntax is a new 
>>> one to me, but they tend to change things around. Could you run lstopo on 
>>> one of those compute nodes and send the output?
>>> 
>>> I’m just suspicious because I’m not seeing a clear pairing of HT numbers in 
>>> your output, but HT numbering is BIOS-specific and I may just not be 
>>> understanding your particular pattern. Our error message is clearly 
>>> indicating that we are seeing individual HTs (and not complete cores) 
>>> assigned, and I don’t know the source of that confusion.
>>> 
>>> 
>>>> On Oct 3, 2015, at 8:28 AM, marcin.krotkiewski 
>>>> <marcin.krotkiew...@gmail.com <mailto:marcin.krotkiew...@gmail.com>> wrote:
>>>> 
>>>> 
>>>> On 10/03/2015 04:38 PM, Ralph Castain wrote:
>>>>> If mpirun isn’t trying to do any binding, then you will of course get the 
>>>>> right mapping as we’ll just inherit whatever we received.
>>>> Yes. I meant that whatever you received (what SLURM gives) is a correct 
>>>> cpu map and assigns _whole_ CPUs, not a single HT to MPI processes. In the 
>>>> case mentioned earlier openmpi should start 6 tasks on c1-30. If HT would 
>>>> be treated as separate and independent cores, sched_getaffinity of an MPI 
>>>> process started on c1-30 would return a map with 6 entries only. In my 
>>>> case it returns a map with 12 entries - 2 for each core. So one  process 
>>>> is in fact allocated both HTs, not only one. Is what I'm saying correct?
>>>> 
>>>>> Looking at your output, it’s pretty clear that you are getting 
>>>>> independent HTs assigned and not full cores. 
>>>> How do you mean? Is the above understanding wrong? I would expect that on 
>>>> c1-30 with --bind-to core openmpi should bind to logical cores 0 and 16 
>>>> (rank 0), 1 and 17 (rank 2) and so on. All those logical cores are 
>>>> available in sched_getaffinity map, and there is twice as many logical 
>>>> cores as there are MPI processes started on the node.
>>>> 
>>>>> My guess is that something in slurm has changed such that it detects that 
>>>>> HT has been enabled, and then begins treating the HTs as completely 
>>>>> independent cpus.
>>>>> 
>>>>> Try changing “-bind-to core” to “-bind-to hwthread  -use-hwthread-cpus” 
>>>>> and see if that works
>>>>> 
>>>> I have and the binding is wrong. For example, I got this output
>>>> 
>>>> rank 0 @ compute-1-30.local  0,
>>>> rank 1 @ compute-1-30.local  16,
>>>> 
>>>> Which means that two ranks have been bound to the same physical core 
>>>> (logical cores 0 and 16 are two HTs of the same core). If I use --bind-to 
>>>> core, I get the following correct binding
>>>> 
>>>> rank 0 @ compute-1-30.local  0, 16,
>>>> 
>>>> The problem is many other ranks get bad binding with 'rank XXX is not 
>>>> bound (or bound to all available processors)' warning.
>>>> 
>>>> But I think I was not entirely correct saying that 1.10.1rc1 did not fix 
>>>> things. It still might have improved something, but not everything. 
>>>> Consider this job:
>>>> 
>>>> SLURM_JOB_CPUS_PER_NODE='5,4,6,5(x2),7,5,9,5,7,6'
>>>> SLURM_JOB_NODELIST='c8-[31,34],c9-[30-32,35-36],c10-[31-34]'
>>>> 
>>>> If I run 32 tasks as follows (with 1.10.1rc1)
>>>> 
>>>> mpirun --hetero-nodes --report-bindings --bind-to core -np 32 ./affinity
>>>> 
>>>> I get the following error:
>>>> 
>>>> --------------------------------------------------------------------------
>>>> A request was made to bind to that would result in binding more
>>>> processes than cpus on a resource:
>>>> 
>>>>    Bind to:     CORE
>>>>    Node:        c9-31
>>>>    #processes:  2
>>>>    #cpus:       1
>>>> 
>>>> You can override this protection by adding the "overload-allowed"
>>>> option to your binding directive.
>>>> --------------------------------------------------------------------------
>>>> 
>>>> 
>>>> If I now use --bind-to core:overload-allowed, then openmpi starts and 
>>>> _most_ of the threads are bound correctly (i.e., map contains two logical 
>>>> cores in ALL cases), except this case that required the overload flag:
>>>> 
>>>> rank 15 @ compute-9-31.local   1, 17,
>>>> rank 16 @ compute-9-31.local  11, 27,
>>>> rank 17 @ compute-9-31.local   2, 18, 
>>>> rank 18 @ compute-9-31.local  12, 28,
>>>> rank 19 @ compute-9-31.local   1, 17,
>>>> 
>>>> Note pair 1,17 is used twice. The original SLURM delivered map (no 
>>>> binding) on this node is
>>>> 
>>>> rank 15 @ compute-9-31.local  1, 2, 11, 12, 13, 17, 18, 27, 28, 29, 
>>>> rank 16 @ compute-9-31.local  1, 2, 11, 12, 13, 17, 18, 27, 28, 29,
>>>> rank 17 @ compute-9-31.local  1, 2, 11, 12, 13, 17, 18, 27, 28, 29,
>>>> rank 18 @ compute-9-31.local  1, 2, 11, 12, 13, 17, 18, 27, 28, 29,
>>>> rank 19 @ compute-9-31.local  1, 2, 11, 12, 13, 17, 18, 27, 28, 29,
>>>> 
>>>> Why does openmpi use cores (1,17) twice instead of using core (13,29)? 
>>>> Clearly, the original SLURM-delivered map has 5 CPUs included, enough for 
>>>> 5 MPI processes. 
>>>> 
>>>> Cheers,
>>>> 
>>>> Marcin
>>>> 
>>>> 
>>>>> 
>>>>>> On Oct 3, 2015, at 7:12 AM, marcin.krotkiewski 
>>>>>> <marcin.krotkiew...@gmail.com <mailto:marcin.krotkiew...@gmail.com>> 
>>>>>> wrote:
>>>>>> 
>>>>>> 
>>>>>> On 10/03/2015 01:06 PM, Ralph Castain wrote:
>>>>>>> Thanks Marcin. Looking at this, I’m guessing that Slurm may be treating 
>>>>>>> HTs as “cores” - i.e., as independent cpus. Any chance that is true?
>>>>>> Not to the best of my knowledge, and at least not intentionally. SLURM 
>>>>>> starts as many processes as there are physical cores, not threads. To 
>>>>>> verify this, consider this test case:
>>>>>> 
>>>>>> SLURM_JOB_CPUS_PER_NODE='6,8(x2),10'
>>>>>> SLURM_JOB_NODELIST='c1-[30-31],c2-[32,34]'
>>>>>> 
>>>>>> If I now execute only one mpi process WITH NO BINDING, it will go onto 
>>>>>> c1-30 and should have a map with 6 CPUs (12 hw threads). I run
>>>>>> 
>>>>>> mpirun --bind-to none -np 1 ./affinity
>>>>>> rank 0 @ compute-1-30.local  0, 1, 3, 4, 5, 6, 16, 17, 19, 20, 21, 22,
>>>>>> 
>>>>>> I have attached the affinity.c program FYI. Clearly, sched_getaffinity 
>>>>>> in my test code returns the correct map.
>>>>>> 
>>>>>> Now if I try to start all 32 processes in this example (still no 
>>>>>> binding):
>>>>>> 
>>>>>> rank 0 @ compute-1-30.local  0, 1, 3, 4, 5, 6, 16, 17, 19, 20, 21, 22,
>>>>>> rank 1 @ compute-1-30.local  0, 1, 3, 4, 5, 6, 16, 17, 19, 20, 21, 22,
>>>>>> rank 10 @ compute-1-31.local  2, 3, 7, 11, 12, 13, 14, 15, 18, 19, 23, 
>>>>>> 27, 28, 29, 30, 31,
>>>>>> rank 11 @ compute-1-31.local  2, 3, 7, 11, 12, 13, 14, 15, 18, 19, 23, 
>>>>>> 27, 28, 29, 30, 31,
>>>>>> rank 12 @ compute-1-31.local  2, 3, 7, 11, 12, 13, 14, 15, 18, 19, 23, 
>>>>>> 27, 28, 29, 30, 31,
>>>>>> rank 13 @ compute-1-31.local  2, 3, 7, 11, 12, 13, 14, 15, 18, 19, 23, 
>>>>>> 27, 28, 29, 30, 31,
>>>>>> rank 6 @ compute-1-31.local  2, 3, 7, 11, 12, 13, 14, 15, 18, 19, 23, 
>>>>>> 27, 28, 29, 30, 31,
>>>>>> rank 2 @ compute-1-30.local  0, 1, 3, 4, 5, 6, 16, 17, 19, 20, 21, 22,
>>>>>> rank 7 @ compute-1-31.local  2, 3, 7, 11, 12, 13, 14, 15, 18, 19, 23, 
>>>>>> 27, 28, 29, 30, 31,
>>>>>> rank 8 @ compute-1-31.local  2, 3, 7, 11, 12, 13, 14, 15, 18, 19, 23, 
>>>>>> 27, 28, 29, 30, 31,
>>>>>> rank 3 @ compute-1-30.local  0, 1, 3, 4, 5, 6, 16, 17, 19, 20, 21, 22,
>>>>>> rank 14 @ compute-2-32.local  7, 8, 9, 10, 11, 12, 13, 14, 23, 24, 25, 
>>>>>> 26, 27, 28, 29, 30,
>>>>>> rank 4 @ compute-1-30.local  0, 1, 3, 4, 5, 6, 16, 17, 19, 20, 21, 22,
>>>>>> rank 15 @ compute-2-32.local  7, 8, 9, 10, 11, 12, 13, 14, 23, 24, 25, 
>>>>>> 26, 27, 28, 29, 30,
>>>>>> rank 9 @ compute-1-31.local  2, 3, 7, 11, 12, 13, 14, 15, 18, 19, 23, 
>>>>>> 27, 28, 29, 30, 31,
>>>>>> rank 5 @ compute-1-30.local  0, 1, 3, 4, 5, 6, 16, 17, 19, 20, 21, 22,
>>>>>> rank 16 @ compute-2-32.local  7, 8, 9, 10, 11, 12, 13, 14, 23, 24, 25, 
>>>>>> 26, 27, 28, 29, 30,
>>>>>> rank 17 @ compute-2-32.local  7, 8, 9, 10, 11, 12, 13, 14, 23, 24, 25, 
>>>>>> 26, 27, 28, 29, 30,
>>>>>> rank 29 @ compute-2-34.local  0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17, 
>>>>>> 18, 19, 20, 21, 22, 23, 30, 31,
>>>>>> rank 30 @ compute-2-34.local  0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17, 
>>>>>> 18, 19, 20, 21, 22, 23, 30, 31,
>>>>>> rank 18 @ compute-2-32.local  7, 8, 9, 10, 11, 12, 13, 14, 23, 24, 25, 
>>>>>> 26, 27, 28, 29, 30,
>>>>>> rank 19 @ compute-2-32.local  7, 8, 9, 10, 11, 12, 13, 14, 23, 24, 25, 
>>>>>> 26, 27, 28, 29, 30,
>>>>>> rank 31 @ compute-2-34.local  0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17, 
>>>>>> 18, 19, 20, 21, 22, 23, 30, 31,
>>>>>> rank 20 @ compute-2-32.local  7, 8, 9, 10, 11, 12, 13, 14, 23, 24, 25, 
>>>>>> 26, 27, 28, 29, 30,
>>>>>> rank 22 @ compute-2-34.local  0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17, 
>>>>>> 18, 19, 20, 21, 22, 23, 30, 31,
>>>>>> rank 21 @ compute-2-32.local  7, 8, 9, 10, 11, 12, 13, 14, 23, 24, 25, 
>>>>>> 26, 27, 28, 29, 30,
>>>>>> rank 23 @ compute-2-34.local  0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17, 
>>>>>> 18, 19, 20, 21, 22, 23, 30, 31,
>>>>>> rank 24 @ compute-2-34.local  0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17, 
>>>>>> 18, 19, 20, 21, 22, 23, 30, 31,
>>>>>> rank 25 @ compute-2-34.local  0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17, 
>>>>>> 18, 19, 20, 21, 22, 23, 30, 31,
>>>>>> rank 26 @ compute-2-34.local  0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17, 
>>>>>> 18, 19, 20, 21, 22, 23, 30, 31,
>>>>>> rank 27 @ compute-2-34.local  0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17, 
>>>>>> 18, 19, 20, 21, 22, 23, 30, 31,
>>>>>> rank 28 @ compute-2-34.local  0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17, 
>>>>>> 18, 19, 20, 21, 22, 23, 30, 31,
>>>>>> 
>>>>>> 
>>>>>> Still looks ok to me. If I now turn the binding on, openmpi fails:
>>>>>> 
>>>>>> 
>>>>>> --------------------------------------------------------------------------
>>>>>> A request was made to bind to that would result in binding more
>>>>>> processes than cpus on a resource:
>>>>>> 
>>>>>>   Bind to:     CORE
>>>>>>   Node:        c1-31
>>>>>>   #processes:  2
>>>>>>   #cpus:       1
>>>>>> 
>>>>>> You can override this protection by adding the "overload-allowed"
>>>>>> option to your binding directive.
>>>>>> --------------------------------------------------------------------------
>>>>>> 
>>>>>> The above tests were done with 1.10.1rc1, so it does not fix the problem.
>>>>>> 
>>>>>> Marcin
>>>>>> 
>>>>>> 
>>>>>>> I’m wondering because bind-to core will attempt to bind your proc to 
>>>>>>> both HTs on the core. For some reason, we thought that 8.24 were HTs on 
>>>>>>> the same core, which is why we tried to bind to that pair of HTs. We 
>>>>>>> got an error because HT #24 was not allocated to us on node c6, but HT 
>>>>>>> #8 was.
>>>>>>> 
>>>>>>> 
>>>>>>>> On Oct 3, 2015, at 2:43 AM, marcin.krotkiewski < 
>>>>>>>> <mailto:marcin.krotkiew...@gmail.com>marcin.krotkiew...@gmail.com 
>>>>>>>> <mailto:marcin.krotkiew...@gmail.com>> wrote:
>>>>>>>> 
>>>>>>>> Hi, Ralph,
>>>>>>>> 
>>>>>>>> I submit my slurm job as follows
>>>>>>>> 
>>>>>>>> salloc --ntasks=64 --mem-per-cpu=2G --time=1:0:0
>>>>>>>> 
>>>>>>>> Effectively, the allocated CPU cores are spread amount many cluster 
>>>>>>>> nodes. SLURM uses cgroups to limit the CPU cores available for mpi 
>>>>>>>> processes running on a given cluster node. Compute nodes are 2-socket, 
>>>>>>>> 8-core E5-2670 systems with HyperThreading on
>>>>>>>> 
>>>>>>>> node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
>>>>>>>> node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31
>>>>>>>> node distances:
>>>>>>>> node   0   1
>>>>>>>>  0:  10  21
>>>>>>>>  1:  21  10
>>>>>>>> 
>>>>>>>> I run MPI program with command
>>>>>>>> 
>>>>>>>> mpirun  --report-bindings --bind-to core -np 64 ./affinity
>>>>>>>> 
>>>>>>>> The program simply runs sched_getaffinity for each process and prints 
>>>>>>>> out the result.
>>>>>>>> 
>>>>>>>> -----------
>>>>>>>> TEST RUN 1
>>>>>>>> -----------
>>>>>>>> For this particular job the problem is more severe: openmpi fails to 
>>>>>>>> run at all with error
>>>>>>>> 
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> Open MPI tried to bind a new process, but something went wrong.  The
>>>>>>>> process was killed without launching the target application.  Your job
>>>>>>>> will now abort.
>>>>>>>> 
>>>>>>>>  Local host:        c6-6
>>>>>>>>  Application name:  ./affinity
>>>>>>>>  Error message:     hwloc_set_cpubind returned "Error" for bitmap 
>>>>>>>> "8,24"
>>>>>>>>  Location:          odls_default_module.c:551
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> 
>>>>>>>> This is SLURM environment variables:
>>>>>>>> 
>>>>>>>> SLURM_JOBID=12712225
>>>>>>>> SLURM_JOB_CPUS_PER_NODE='3(x2),2,1(x3),2(x2),1,3(x3),5,1,4,1,3,2,3,7,1,5,6,1'
>>>>>>>> SLURM_JOB_ID=12712225
>>>>>>>> SLURM_JOB_NODELIST='c6-[3,6-8,12,14,17,22-23],c8-[4,7,9,17,20,28],c15-[5,10,18,20,22-24,28],c16-11'
>>>>>>>> SLURM_JOB_NUM_NODES=24
>>>>>>>> SLURM_JOB_PARTITION=normal
>>>>>>>> SLURM_MEM_PER_CPU=2048
>>>>>>>> SLURM_NNODES=24
>>>>>>>> SLURM_NODELIST='c6-[3,6-8,12,14,17,22-23],c8-[4,7,9,17,20,28],c15-[5,10,18,20,22-24,28],c16-11'
>>>>>>>> SLURM_NODE_ALIASES='(null)'
>>>>>>>> SLURM_NPROCS=64
>>>>>>>> SLURM_NTASKS=64
>>>>>>>> SLURM_SUBMIT_DIR=/cluster/home/marcink
>>>>>>>> SLURM_SUBMIT_HOST=login-0-2.local
>>>>>>>> SLURM_TASKS_PER_NODE='3(x2),2,1(x3),2(x2),1,3(x3),5,1,4,1,3,2,3,7,1,5,6,1'
>>>>>>>> 
>>>>>>>> There is also a lot of warnings like
>>>>>>>> 
>>>>>>>> [compute-6-6.local:20158] MCW rank 4 is not bound (or bound to all 
>>>>>>>> available processors)
>>>>>>>> 
>>>>>>>> 
>>>>>>>> -----------
>>>>>>>> TEST RUN 2
>>>>>>>> -----------
>>>>>>>> 
>>>>>>>> In another allocation I got a different error
>>>>>>>> 
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> A request was made to bind to that would result in binding more
>>>>>>>> processes than cpus on a resource:
>>>>>>>> 
>>>>>>>>   Bind to:     CORE
>>>>>>>>   Node:        c6-19
>>>>>>>>   #processes:  2
>>>>>>>>   #cpus:       1
>>>>>>>> 
>>>>>>>> You can override this protection by adding the "overload-allowed"
>>>>>>>> option to your binding directive.
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> 
>>>>>>>> and the allocation was the following
>>>>>>>> 
>>>>>>>> SLURM_JOBID=12712250
>>>>>>>> SLURM_JOB_CPUS_PER_NODE='3(x2),2,1,15,1,3,16,2,1,3(x2),2,5,4'
>>>>>>>> SLURM_JOB_ID=12712250
>>>>>>>> SLURM_JOB_NODELIST='c6-[3,6-8,12,14,17,19,22-23],c8-[4,7,9,17,28]'
>>>>>>>> SLURM_JOB_NUM_NODES=15
>>>>>>>> SLURM_JOB_PARTITION=normal
>>>>>>>> SLURM_MEM_PER_CPU=2048
>>>>>>>> SLURM_NNODES=15
>>>>>>>> SLURM_NODELIST='c6-[3,6-8,12,14,17,19,22-23],c8-[4,7,9,17,28]'
>>>>>>>> SLURM_NODE_ALIASES='(null)'
>>>>>>>> SLURM_NPROCS=64
>>>>>>>> SLURM_NTASKS=64
>>>>>>>> SLURM_SUBMIT_DIR=/cluster/home/marcink
>>>>>>>> SLURM_SUBMIT_HOST=login-0-2.local
>>>>>>>> SLURM_TASKS_PER_NODE='3(x2),2,1,15,1,3,16,2,1,3(x2),2,5,4'
>>>>>>>> 
>>>>>>>> 
>>>>>>>> If in this case I run on only 32 cores
>>>>>>>> 
>>>>>>>> mpirun  --report-bindings --bind-to core -np 32 ./affinity
>>>>>>>> 
>>>>>>>> the process starts, but I get the original binding problem:
>>>>>>>> 
>>>>>>>> [compute-6-8.local:31414] MCW rank 8 is not bound (or bound to all 
>>>>>>>> available processors)
>>>>>>>> 
>>>>>>>> Running with --hetero-nodes yields exactly the same results
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Hope the above is useful. The problem with binding under SLURM with 
>>>>>>>> CPU cores spread over nodes seems to be very reproducible. It is 
>>>>>>>> actually very often that OpenMPI dies with some error like above. 
>>>>>>>> These tests were run with openmpi-1.8.8 and 1.10.0, both giving same 
>>>>>>>> results.
>>>>>>>> 
>>>>>>>> One more suggestion. The warning message (MCW rank 8 is not bound...) 
>>>>>>>> is ONLY displayed when I use --report-bindings. It is never shown if I 
>>>>>>>> leave out this option, and although the binding is wrong the user is 
>>>>>>>> not notified. I think it would be better to show this warning in all 
>>>>>>>> cases binding fails.
>>>>>>>> 
>>>>>>>> Let me know if you need more information. I can help to debug this - 
>>>>>>>> it is a rather crucial issue.
>>>>>>>> 
>>>>>>>> Thanks!
>>>>>>>> 
>>>>>>>> Marcin
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On 10/02/2015 11:49 PM, Ralph Castain wrote:
>>>>>>>>> Can you please send me the allocation request you made (so I can see 
>>>>>>>>> what you specified on the cmd line), and the mpirun cmd line?
>>>>>>>>> 
>>>>>>>>> Thanks
>>>>>>>>> Ralph
>>>>>>>>> 
>>>>>>>>>> On Oct 2, 2015, at 8:25 AM, Marcin Krotkiewski 
>>>>>>>>>> <marcin.krotkiew...@gmail.com <mailto:marcin.krotkiew...@gmail.com>> 
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Hi,
>>>>>>>>>> 
>>>>>>>>>> I fail to make OpenMPI bind to cores correctly when running from 
>>>>>>>>>> within SLURM-allocated CPU resources spread over a range of compute 
>>>>>>>>>> nodes in an otherwise homogeneous cluster. I have found this thread
>>>>>>>>>> 
>>>>>>>>>> http://www.open-mpi.org/community/lists/users/2014/06/24682.php 
>>>>>>>>>> <http://www.open-mpi.org/community/lists/users/2014/06/24682.php>
>>>>>>>>>> 
>>>>>>>>>> and did try to use what Ralph suggested there (--hetero-nodes), but 
>>>>>>>>>> it does not work (v. 1.10.0). When running with --report-bindings I 
>>>>>>>>>> get messages like
>>>>>>>>>> 
>>>>>>>>>> [compute-9-11.local:27571] MCW rank 10 is not bound (or bound to all 
>>>>>>>>>> available processors)
>>>>>>>>>> 
>>>>>>>>>> for all ranks outside of my first physical compute node. Moreover, 
>>>>>>>>>> everything works as expected if I ask SLURM to assign entire compute 
>>>>>>>>>> nodes. So it does look like Ralph's diagnose presented in that 
>>>>>>>>>> thread is correct, just the --hetero-nodes switch does not work for 
>>>>>>>>>> me.
>>>>>>>>>> 
>>>>>>>>>> I have written a short code that uses sched_getaffinity to print the 
>>>>>>>>>> effective bindings: all MPI ranks except of those on the first node 
>>>>>>>>>> are bound to all CPU cores allocated by SLURM.
>>>>>>>>>> 
>>>>>>>>>> Do I have to do something except of --hetero-nodes, or is this a 
>>>>>>>>>> problem that needs further investigation?
>>>>>>>>>> 
>>>>>>>>>> Thanks a lot!
>>>>>>>>>> 
>>>>>>>>>> Marcin
>>>>>>>>>> 
>>>>>>>>>> _______________________________________________
>>>>>>>>>> users mailing list
>>>>>>>>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>>>>>>>>>> Subscription:  
>>>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>  <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>>>>>>>>> Link to this post:  
>>>>>>>>>> <http://www.open-mpi.org/community/lists/users/2015/10/27770.php>http://www.open-mpi.org/community/lists/users/2015/10/27770.php
>>>>>>>>>>  <http://www.open-mpi.org/community/lists/users/2015/10/27770.php>
>>>>>>>>> _______________________________________________
>>>>>>>>> users mailing list
>>>>>>>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>>>>>>>>> Subscription:  
>>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>  <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>>>>>>>> Link to this post:  
>>>>>>>>> <http://www.open-mpi.org/community/lists/users/2015/10/27774.php>http://www.open-mpi.org/community/lists/users/2015/10/27774.php
>>>>>>>>>  <http://www.open-mpi.org/community/lists/users/2015/10/27774.php>
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>>>>>>>> Subscription:  
>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>  <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>>>>>>> Link to this post:  
>>>>>>>> <http://www.open-mpi.org/community/lists/users/2015/10/27776.php>http://www.open-mpi.org/community/lists/users/2015/10/27776.php
>>>>>>>>  <http://www.open-mpi.org/community/lists/users/2015/10/27776.php>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>>>>>>> Subscription:  
>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>  <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>>>>>> Link to this post:  
>>>>>>> <http://www.open-mpi.org/community/lists/users/2015/10/27778.php>http://www.open-mpi.org/community/lists/users/2015/10/27778.php
>>>>>>>  <http://www.open-mpi.org/community/lists/users/2015/10/27778.php>
>>>>>> 
>>>>>> <affinity.c>_______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>>>>>> Subscription:  
>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>  <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>>>>> Link to this post:  
>>>>>> <http://www.open-mpi.org/community/lists/users/2015/10/27781.php>http://www.open-mpi.org/community/lists/users/2015/10/27781.php
>>>>>>  <http://www.open-mpi.org/community/lists/users/2015/10/27781.php>
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>>>> Link to this post: 
>>>>> http://www.open-mpi.org/community/lists/users/2015/10/27782.php 
>>>>> <http://www.open-mpi.org/community/lists/users/2015/10/27782.php>
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>>>> Subscription:  
>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>  <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>>> Link to this post:  
>>>> <http://www.open-mpi.org/community/lists/users/2015/10/27783.php>http://www.open-mpi.org/community/lists/users/2015/10/27783.php
>>>>  <http://www.open-mpi.org/community/lists/users/2015/10/27783.php>
>>> 
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/users/2015/10/27784.php 
>>> <http://www.open-mpi.org/community/lists/users/2015/10/27784.php>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2015/10/27785.php
>

Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

Reply via email to