Rats - just realized I have no way to test this as none of the machines I can access are setup for cgroup-based multi-tenant. Is this a debug version of OMPI? If not, can you rebuild OMPI with —enable-debug?
Then please run it with —mca rmaps_base_verbose 10 and pass along the output. Thanks Ralph > On Oct 3, 2015, at 10:09 AM, Ralph Castain <r...@open-mpi.org> wrote: > > What version of slurm is this? I might try to debug it here. I’m not sure > where the problem lies just yet. > > >> On Oct 3, 2015, at 8:59 AM, marcin.krotkiewski <marcin.krotkiew...@gmail.com >> <mailto:marcin.krotkiew...@gmail.com>> wrote: >> >> Here is the output of lstopo. In short, (0,16) are core 0, (1,17) - core 1 >> etc. >> >> Machine (64GB) >> NUMANode L#0 (P#0 32GB) >> Socket L#0 + L3 L#0 (20MB) >> L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 >> PU L#0 (P#0) >> PU L#1 (P#16) >> L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 >> PU L#2 (P#1) >> PU L#3 (P#17) >> L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 >> PU L#4 (P#2) >> PU L#5 (P#18) >> L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 >> PU L#6 (P#3) >> PU L#7 (P#19) >> L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 >> PU L#8 (P#4) >> PU L#9 (P#20) >> L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 >> PU L#10 (P#5) >> PU L#11 (P#21) >> L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 >> PU L#12 (P#6) >> PU L#13 (P#22) >> L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 >> PU L#14 (P#7) >> PU L#15 (P#23) >> HostBridge L#0 >> PCIBridge >> PCI 8086:1521 >> Net L#0 "eth0" >> PCI 8086:1521 >> Net L#1 "eth1" >> PCIBridge >> PCI 15b3:1003 >> Net L#2 "ib0" >> OpenFabrics L#3 "mlx4_0" >> PCIBridge >> PCI 102b:0532 >> PCI 8086:1d02 >> Block L#4 "sda" >> NUMANode L#1 (P#1 32GB) + Socket L#1 + L3 L#1 (20MB) >> L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 >> PU L#16 (P#8) >> PU L#17 (P#24) >> L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 >> PU L#18 (P#9) >> PU L#19 (P#25) >> L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10 >> PU L#20 (P#10) >> PU L#21 (P#26) >> L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11 >> PU L#22 (P#11) >> PU L#23 (P#27) >> L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12 >> PU L#24 (P#12) >> PU L#25 (P#28) >> L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13 >> PU L#26 (P#13) >> PU L#27 (P#29) >> L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14 >> PU L#28 (P#14) >> PU L#29 (P#30) >> L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15 >> PU L#30 (P#15) >> PU L#31 (P#31) >> >> >> >> On 10/03/2015 05:46 PM, Ralph Castain wrote: >>> Maybe I’m just misreading your HT map - that slurm nodelist syntax is a new >>> one to me, but they tend to change things around. Could you run lstopo on >>> one of those compute nodes and send the output? >>> >>> I’m just suspicious because I’m not seeing a clear pairing of HT numbers in >>> your output, but HT numbering is BIOS-specific and I may just not be >>> understanding your particular pattern. Our error message is clearly >>> indicating that we are seeing individual HTs (and not complete cores) >>> assigned, and I don’t know the source of that confusion. >>> >>> >>>> On Oct 3, 2015, at 8:28 AM, marcin.krotkiewski >>>> <marcin.krotkiew...@gmail.com <mailto:marcin.krotkiew...@gmail.com>> wrote: >>>> >>>> >>>> On 10/03/2015 04:38 PM, Ralph Castain wrote: >>>>> If mpirun isn’t trying to do any binding, then you will of course get the >>>>> right mapping as we’ll just inherit whatever we received. >>>> Yes. I meant that whatever you received (what SLURM gives) is a correct >>>> cpu map and assigns _whole_ CPUs, not a single HT to MPI processes. In the >>>> case mentioned earlier openmpi should start 6 tasks on c1-30. If HT would >>>> be treated as separate and independent cores, sched_getaffinity of an MPI >>>> process started on c1-30 would return a map with 6 entries only. In my >>>> case it returns a map with 12 entries - 2 for each core. So one process >>>> is in fact allocated both HTs, not only one. Is what I'm saying correct? >>>> >>>>> Looking at your output, it’s pretty clear that you are getting >>>>> independent HTs assigned and not full cores. >>>> How do you mean? Is the above understanding wrong? I would expect that on >>>> c1-30 with --bind-to core openmpi should bind to logical cores 0 and 16 >>>> (rank 0), 1 and 17 (rank 2) and so on. All those logical cores are >>>> available in sched_getaffinity map, and there is twice as many logical >>>> cores as there are MPI processes started on the node. >>>> >>>>> My guess is that something in slurm has changed such that it detects that >>>>> HT has been enabled, and then begins treating the HTs as completely >>>>> independent cpus. >>>>> >>>>> Try changing “-bind-to core” to “-bind-to hwthread -use-hwthread-cpus” >>>>> and see if that works >>>>> >>>> I have and the binding is wrong. For example, I got this output >>>> >>>> rank 0 @ compute-1-30.local 0, >>>> rank 1 @ compute-1-30.local 16, >>>> >>>> Which means that two ranks have been bound to the same physical core >>>> (logical cores 0 and 16 are two HTs of the same core). If I use --bind-to >>>> core, I get the following correct binding >>>> >>>> rank 0 @ compute-1-30.local 0, 16, >>>> >>>> The problem is many other ranks get bad binding with 'rank XXX is not >>>> bound (or bound to all available processors)' warning. >>>> >>>> But I think I was not entirely correct saying that 1.10.1rc1 did not fix >>>> things. It still might have improved something, but not everything. >>>> Consider this job: >>>> >>>> SLURM_JOB_CPUS_PER_NODE='5,4,6,5(x2),7,5,9,5,7,6' >>>> SLURM_JOB_NODELIST='c8-[31,34],c9-[30-32,35-36],c10-[31-34]' >>>> >>>> If I run 32 tasks as follows (with 1.10.1rc1) >>>> >>>> mpirun --hetero-nodes --report-bindings --bind-to core -np 32 ./affinity >>>> >>>> I get the following error: >>>> >>>> -------------------------------------------------------------------------- >>>> A request was made to bind to that would result in binding more >>>> processes than cpus on a resource: >>>> >>>> Bind to: CORE >>>> Node: c9-31 >>>> #processes: 2 >>>> #cpus: 1 >>>> >>>> You can override this protection by adding the "overload-allowed" >>>> option to your binding directive. >>>> -------------------------------------------------------------------------- >>>> >>>> >>>> If I now use --bind-to core:overload-allowed, then openmpi starts and >>>> _most_ of the threads are bound correctly (i.e., map contains two logical >>>> cores in ALL cases), except this case that required the overload flag: >>>> >>>> rank 15 @ compute-9-31.local 1, 17, >>>> rank 16 @ compute-9-31.local 11, 27, >>>> rank 17 @ compute-9-31.local 2, 18, >>>> rank 18 @ compute-9-31.local 12, 28, >>>> rank 19 @ compute-9-31.local 1, 17, >>>> >>>> Note pair 1,17 is used twice. The original SLURM delivered map (no >>>> binding) on this node is >>>> >>>> rank 15 @ compute-9-31.local 1, 2, 11, 12, 13, 17, 18, 27, 28, 29, >>>> rank 16 @ compute-9-31.local 1, 2, 11, 12, 13, 17, 18, 27, 28, 29, >>>> rank 17 @ compute-9-31.local 1, 2, 11, 12, 13, 17, 18, 27, 28, 29, >>>> rank 18 @ compute-9-31.local 1, 2, 11, 12, 13, 17, 18, 27, 28, 29, >>>> rank 19 @ compute-9-31.local 1, 2, 11, 12, 13, 17, 18, 27, 28, 29, >>>> >>>> Why does openmpi use cores (1,17) twice instead of using core (13,29)? >>>> Clearly, the original SLURM-delivered map has 5 CPUs included, enough for >>>> 5 MPI processes. >>>> >>>> Cheers, >>>> >>>> Marcin >>>> >>>> >>>>> >>>>>> On Oct 3, 2015, at 7:12 AM, marcin.krotkiewski >>>>>> <marcin.krotkiew...@gmail.com <mailto:marcin.krotkiew...@gmail.com>> >>>>>> wrote: >>>>>> >>>>>> >>>>>> On 10/03/2015 01:06 PM, Ralph Castain wrote: >>>>>>> Thanks Marcin. Looking at this, I’m guessing that Slurm may be treating >>>>>>> HTs as “cores” - i.e., as independent cpus. Any chance that is true? >>>>>> Not to the best of my knowledge, and at least not intentionally. SLURM >>>>>> starts as many processes as there are physical cores, not threads. To >>>>>> verify this, consider this test case: >>>>>> >>>>>> SLURM_JOB_CPUS_PER_NODE='6,8(x2),10' >>>>>> SLURM_JOB_NODELIST='c1-[30-31],c2-[32,34]' >>>>>> >>>>>> If I now execute only one mpi process WITH NO BINDING, it will go onto >>>>>> c1-30 and should have a map with 6 CPUs (12 hw threads). I run >>>>>> >>>>>> mpirun --bind-to none -np 1 ./affinity >>>>>> rank 0 @ compute-1-30.local 0, 1, 3, 4, 5, 6, 16, 17, 19, 20, 21, 22, >>>>>> >>>>>> I have attached the affinity.c program FYI. Clearly, sched_getaffinity >>>>>> in my test code returns the correct map. >>>>>> >>>>>> Now if I try to start all 32 processes in this example (still no >>>>>> binding): >>>>>> >>>>>> rank 0 @ compute-1-30.local 0, 1, 3, 4, 5, 6, 16, 17, 19, 20, 21, 22, >>>>>> rank 1 @ compute-1-30.local 0, 1, 3, 4, 5, 6, 16, 17, 19, 20, 21, 22, >>>>>> rank 10 @ compute-1-31.local 2, 3, 7, 11, 12, 13, 14, 15, 18, 19, 23, >>>>>> 27, 28, 29, 30, 31, >>>>>> rank 11 @ compute-1-31.local 2, 3, 7, 11, 12, 13, 14, 15, 18, 19, 23, >>>>>> 27, 28, 29, 30, 31, >>>>>> rank 12 @ compute-1-31.local 2, 3, 7, 11, 12, 13, 14, 15, 18, 19, 23, >>>>>> 27, 28, 29, 30, 31, >>>>>> rank 13 @ compute-1-31.local 2, 3, 7, 11, 12, 13, 14, 15, 18, 19, 23, >>>>>> 27, 28, 29, 30, 31, >>>>>> rank 6 @ compute-1-31.local 2, 3, 7, 11, 12, 13, 14, 15, 18, 19, 23, >>>>>> 27, 28, 29, 30, 31, >>>>>> rank 2 @ compute-1-30.local 0, 1, 3, 4, 5, 6, 16, 17, 19, 20, 21, 22, >>>>>> rank 7 @ compute-1-31.local 2, 3, 7, 11, 12, 13, 14, 15, 18, 19, 23, >>>>>> 27, 28, 29, 30, 31, >>>>>> rank 8 @ compute-1-31.local 2, 3, 7, 11, 12, 13, 14, 15, 18, 19, 23, >>>>>> 27, 28, 29, 30, 31, >>>>>> rank 3 @ compute-1-30.local 0, 1, 3, 4, 5, 6, 16, 17, 19, 20, 21, 22, >>>>>> rank 14 @ compute-2-32.local 7, 8, 9, 10, 11, 12, 13, 14, 23, 24, 25, >>>>>> 26, 27, 28, 29, 30, >>>>>> rank 4 @ compute-1-30.local 0, 1, 3, 4, 5, 6, 16, 17, 19, 20, 21, 22, >>>>>> rank 15 @ compute-2-32.local 7, 8, 9, 10, 11, 12, 13, 14, 23, 24, 25, >>>>>> 26, 27, 28, 29, 30, >>>>>> rank 9 @ compute-1-31.local 2, 3, 7, 11, 12, 13, 14, 15, 18, 19, 23, >>>>>> 27, 28, 29, 30, 31, >>>>>> rank 5 @ compute-1-30.local 0, 1, 3, 4, 5, 6, 16, 17, 19, 20, 21, 22, >>>>>> rank 16 @ compute-2-32.local 7, 8, 9, 10, 11, 12, 13, 14, 23, 24, 25, >>>>>> 26, 27, 28, 29, 30, >>>>>> rank 17 @ compute-2-32.local 7, 8, 9, 10, 11, 12, 13, 14, 23, 24, 25, >>>>>> 26, 27, 28, 29, 30, >>>>>> rank 29 @ compute-2-34.local 0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17, >>>>>> 18, 19, 20, 21, 22, 23, 30, 31, >>>>>> rank 30 @ compute-2-34.local 0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17, >>>>>> 18, 19, 20, 21, 22, 23, 30, 31, >>>>>> rank 18 @ compute-2-32.local 7, 8, 9, 10, 11, 12, 13, 14, 23, 24, 25, >>>>>> 26, 27, 28, 29, 30, >>>>>> rank 19 @ compute-2-32.local 7, 8, 9, 10, 11, 12, 13, 14, 23, 24, 25, >>>>>> 26, 27, 28, 29, 30, >>>>>> rank 31 @ compute-2-34.local 0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17, >>>>>> 18, 19, 20, 21, 22, 23, 30, 31, >>>>>> rank 20 @ compute-2-32.local 7, 8, 9, 10, 11, 12, 13, 14, 23, 24, 25, >>>>>> 26, 27, 28, 29, 30, >>>>>> rank 22 @ compute-2-34.local 0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17, >>>>>> 18, 19, 20, 21, 22, 23, 30, 31, >>>>>> rank 21 @ compute-2-32.local 7, 8, 9, 10, 11, 12, 13, 14, 23, 24, 25, >>>>>> 26, 27, 28, 29, 30, >>>>>> rank 23 @ compute-2-34.local 0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17, >>>>>> 18, 19, 20, 21, 22, 23, 30, 31, >>>>>> rank 24 @ compute-2-34.local 0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17, >>>>>> 18, 19, 20, 21, 22, 23, 30, 31, >>>>>> rank 25 @ compute-2-34.local 0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17, >>>>>> 18, 19, 20, 21, 22, 23, 30, 31, >>>>>> rank 26 @ compute-2-34.local 0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17, >>>>>> 18, 19, 20, 21, 22, 23, 30, 31, >>>>>> rank 27 @ compute-2-34.local 0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17, >>>>>> 18, 19, 20, 21, 22, 23, 30, 31, >>>>>> rank 28 @ compute-2-34.local 0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17, >>>>>> 18, 19, 20, 21, 22, 23, 30, 31, >>>>>> >>>>>> >>>>>> Still looks ok to me. If I now turn the binding on, openmpi fails: >>>>>> >>>>>> >>>>>> -------------------------------------------------------------------------- >>>>>> A request was made to bind to that would result in binding more >>>>>> processes than cpus on a resource: >>>>>> >>>>>> Bind to: CORE >>>>>> Node: c1-31 >>>>>> #processes: 2 >>>>>> #cpus: 1 >>>>>> >>>>>> You can override this protection by adding the "overload-allowed" >>>>>> option to your binding directive. >>>>>> -------------------------------------------------------------------------- >>>>>> >>>>>> The above tests were done with 1.10.1rc1, so it does not fix the problem. >>>>>> >>>>>> Marcin >>>>>> >>>>>> >>>>>>> I’m wondering because bind-to core will attempt to bind your proc to >>>>>>> both HTs on the core. For some reason, we thought that 8.24 were HTs on >>>>>>> the same core, which is why we tried to bind to that pair of HTs. We >>>>>>> got an error because HT #24 was not allocated to us on node c6, but HT >>>>>>> #8 was. >>>>>>> >>>>>>> >>>>>>>> On Oct 3, 2015, at 2:43 AM, marcin.krotkiewski < >>>>>>>> <mailto:marcin.krotkiew...@gmail.com>marcin.krotkiew...@gmail.com >>>>>>>> <mailto:marcin.krotkiew...@gmail.com>> wrote: >>>>>>>> >>>>>>>> Hi, Ralph, >>>>>>>> >>>>>>>> I submit my slurm job as follows >>>>>>>> >>>>>>>> salloc --ntasks=64 --mem-per-cpu=2G --time=1:0:0 >>>>>>>> >>>>>>>> Effectively, the allocated CPU cores are spread amount many cluster >>>>>>>> nodes. SLURM uses cgroups to limit the CPU cores available for mpi >>>>>>>> processes running on a given cluster node. Compute nodes are 2-socket, >>>>>>>> 8-core E5-2670 systems with HyperThreading on >>>>>>>> >>>>>>>> node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23 >>>>>>>> node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31 >>>>>>>> node distances: >>>>>>>> node 0 1 >>>>>>>> 0: 10 21 >>>>>>>> 1: 21 10 >>>>>>>> >>>>>>>> I run MPI program with command >>>>>>>> >>>>>>>> mpirun --report-bindings --bind-to core -np 64 ./affinity >>>>>>>> >>>>>>>> The program simply runs sched_getaffinity for each process and prints >>>>>>>> out the result. >>>>>>>> >>>>>>>> ----------- >>>>>>>> TEST RUN 1 >>>>>>>> ----------- >>>>>>>> For this particular job the problem is more severe: openmpi fails to >>>>>>>> run at all with error >>>>>>>> >>>>>>>> -------------------------------------------------------------------------- >>>>>>>> Open MPI tried to bind a new process, but something went wrong. The >>>>>>>> process was killed without launching the target application. Your job >>>>>>>> will now abort. >>>>>>>> >>>>>>>> Local host: c6-6 >>>>>>>> Application name: ./affinity >>>>>>>> Error message: hwloc_set_cpubind returned "Error" for bitmap >>>>>>>> "8,24" >>>>>>>> Location: odls_default_module.c:551 >>>>>>>> -------------------------------------------------------------------------- >>>>>>>> >>>>>>>> This is SLURM environment variables: >>>>>>>> >>>>>>>> SLURM_JOBID=12712225 >>>>>>>> SLURM_JOB_CPUS_PER_NODE='3(x2),2,1(x3),2(x2),1,3(x3),5,1,4,1,3,2,3,7,1,5,6,1' >>>>>>>> SLURM_JOB_ID=12712225 >>>>>>>> SLURM_JOB_NODELIST='c6-[3,6-8,12,14,17,22-23],c8-[4,7,9,17,20,28],c15-[5,10,18,20,22-24,28],c16-11' >>>>>>>> SLURM_JOB_NUM_NODES=24 >>>>>>>> SLURM_JOB_PARTITION=normal >>>>>>>> SLURM_MEM_PER_CPU=2048 >>>>>>>> SLURM_NNODES=24 >>>>>>>> SLURM_NODELIST='c6-[3,6-8,12,14,17,22-23],c8-[4,7,9,17,20,28],c15-[5,10,18,20,22-24,28],c16-11' >>>>>>>> SLURM_NODE_ALIASES='(null)' >>>>>>>> SLURM_NPROCS=64 >>>>>>>> SLURM_NTASKS=64 >>>>>>>> SLURM_SUBMIT_DIR=/cluster/home/marcink >>>>>>>> SLURM_SUBMIT_HOST=login-0-2.local >>>>>>>> SLURM_TASKS_PER_NODE='3(x2),2,1(x3),2(x2),1,3(x3),5,1,4,1,3,2,3,7,1,5,6,1' >>>>>>>> >>>>>>>> There is also a lot of warnings like >>>>>>>> >>>>>>>> [compute-6-6.local:20158] MCW rank 4 is not bound (or bound to all >>>>>>>> available processors) >>>>>>>> >>>>>>>> >>>>>>>> ----------- >>>>>>>> TEST RUN 2 >>>>>>>> ----------- >>>>>>>> >>>>>>>> In another allocation I got a different error >>>>>>>> >>>>>>>> -------------------------------------------------------------------------- >>>>>>>> A request was made to bind to that would result in binding more >>>>>>>> processes than cpus on a resource: >>>>>>>> >>>>>>>> Bind to: CORE >>>>>>>> Node: c6-19 >>>>>>>> #processes: 2 >>>>>>>> #cpus: 1 >>>>>>>> >>>>>>>> You can override this protection by adding the "overload-allowed" >>>>>>>> option to your binding directive. >>>>>>>> -------------------------------------------------------------------------- >>>>>>>> >>>>>>>> and the allocation was the following >>>>>>>> >>>>>>>> SLURM_JOBID=12712250 >>>>>>>> SLURM_JOB_CPUS_PER_NODE='3(x2),2,1,15,1,3,16,2,1,3(x2),2,5,4' >>>>>>>> SLURM_JOB_ID=12712250 >>>>>>>> SLURM_JOB_NODELIST='c6-[3,6-8,12,14,17,19,22-23],c8-[4,7,9,17,28]' >>>>>>>> SLURM_JOB_NUM_NODES=15 >>>>>>>> SLURM_JOB_PARTITION=normal >>>>>>>> SLURM_MEM_PER_CPU=2048 >>>>>>>> SLURM_NNODES=15 >>>>>>>> SLURM_NODELIST='c6-[3,6-8,12,14,17,19,22-23],c8-[4,7,9,17,28]' >>>>>>>> SLURM_NODE_ALIASES='(null)' >>>>>>>> SLURM_NPROCS=64 >>>>>>>> SLURM_NTASKS=64 >>>>>>>> SLURM_SUBMIT_DIR=/cluster/home/marcink >>>>>>>> SLURM_SUBMIT_HOST=login-0-2.local >>>>>>>> SLURM_TASKS_PER_NODE='3(x2),2,1,15,1,3,16,2,1,3(x2),2,5,4' >>>>>>>> >>>>>>>> >>>>>>>> If in this case I run on only 32 cores >>>>>>>> >>>>>>>> mpirun --report-bindings --bind-to core -np 32 ./affinity >>>>>>>> >>>>>>>> the process starts, but I get the original binding problem: >>>>>>>> >>>>>>>> [compute-6-8.local:31414] MCW rank 8 is not bound (or bound to all >>>>>>>> available processors) >>>>>>>> >>>>>>>> Running with --hetero-nodes yields exactly the same results >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Hope the above is useful. The problem with binding under SLURM with >>>>>>>> CPU cores spread over nodes seems to be very reproducible. It is >>>>>>>> actually very often that OpenMPI dies with some error like above. >>>>>>>> These tests were run with openmpi-1.8.8 and 1.10.0, both giving same >>>>>>>> results. >>>>>>>> >>>>>>>> One more suggestion. The warning message (MCW rank 8 is not bound...) >>>>>>>> is ONLY displayed when I use --report-bindings. It is never shown if I >>>>>>>> leave out this option, and although the binding is wrong the user is >>>>>>>> not notified. I think it would be better to show this warning in all >>>>>>>> cases binding fails. >>>>>>>> >>>>>>>> Let me know if you need more information. I can help to debug this - >>>>>>>> it is a rather crucial issue. >>>>>>>> >>>>>>>> Thanks! >>>>>>>> >>>>>>>> Marcin >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On 10/02/2015 11:49 PM, Ralph Castain wrote: >>>>>>>>> Can you please send me the allocation request you made (so I can see >>>>>>>>> what you specified on the cmd line), and the mpirun cmd line? >>>>>>>>> >>>>>>>>> Thanks >>>>>>>>> Ralph >>>>>>>>> >>>>>>>>>> On Oct 2, 2015, at 8:25 AM, Marcin Krotkiewski >>>>>>>>>> <marcin.krotkiew...@gmail.com <mailto:marcin.krotkiew...@gmail.com>> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> I fail to make OpenMPI bind to cores correctly when running from >>>>>>>>>> within SLURM-allocated CPU resources spread over a range of compute >>>>>>>>>> nodes in an otherwise homogeneous cluster. I have found this thread >>>>>>>>>> >>>>>>>>>> http://www.open-mpi.org/community/lists/users/2014/06/24682.php >>>>>>>>>> <http://www.open-mpi.org/community/lists/users/2014/06/24682.php> >>>>>>>>>> >>>>>>>>>> and did try to use what Ralph suggested there (--hetero-nodes), but >>>>>>>>>> it does not work (v. 1.10.0). When running with --report-bindings I >>>>>>>>>> get messages like >>>>>>>>>> >>>>>>>>>> [compute-9-11.local:27571] MCW rank 10 is not bound (or bound to all >>>>>>>>>> available processors) >>>>>>>>>> >>>>>>>>>> for all ranks outside of my first physical compute node. Moreover, >>>>>>>>>> everything works as expected if I ask SLURM to assign entire compute >>>>>>>>>> nodes. So it does look like Ralph's diagnose presented in that >>>>>>>>>> thread is correct, just the --hetero-nodes switch does not work for >>>>>>>>>> me. >>>>>>>>>> >>>>>>>>>> I have written a short code that uses sched_getaffinity to print the >>>>>>>>>> effective bindings: all MPI ranks except of those on the first node >>>>>>>>>> are bound to all CPU cores allocated by SLURM. >>>>>>>>>> >>>>>>>>>> Do I have to do something except of --hetero-nodes, or is this a >>>>>>>>>> problem that needs further investigation? >>>>>>>>>> >>>>>>>>>> Thanks a lot! >>>>>>>>>> >>>>>>>>>> Marcin >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> users mailing list >>>>>>>>>> us...@open-mpi.org <mailto:us...@open-mpi.org> >>>>>>>>>> Subscription: >>>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users> >>>>>>>>>> Link to this post: >>>>>>>>>> <http://www.open-mpi.org/community/lists/users/2015/10/27770.php>http://www.open-mpi.org/community/lists/users/2015/10/27770.php >>>>>>>>>> <http://www.open-mpi.org/community/lists/users/2015/10/27770.php> >>>>>>>>> _______________________________________________ >>>>>>>>> users mailing list >>>>>>>>> us...@open-mpi.org <mailto:us...@open-mpi.org> >>>>>>>>> Subscription: >>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users> >>>>>>>>> Link to this post: >>>>>>>>> <http://www.open-mpi.org/community/lists/users/2015/10/27774.php>http://www.open-mpi.org/community/lists/users/2015/10/27774.php >>>>>>>>> <http://www.open-mpi.org/community/lists/users/2015/10/27774.php> >>>>>>>> _______________________________________________ >>>>>>>> users mailing list >>>>>>>> us...@open-mpi.org <mailto:us...@open-mpi.org> >>>>>>>> Subscription: >>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users> >>>>>>>> Link to this post: >>>>>>>> <http://www.open-mpi.org/community/lists/users/2015/10/27776.php>http://www.open-mpi.org/community/lists/users/2015/10/27776.php >>>>>>>> <http://www.open-mpi.org/community/lists/users/2015/10/27776.php> >>>>>>> _______________________________________________ >>>>>>> users mailing list >>>>>>> us...@open-mpi.org <mailto:us...@open-mpi.org> >>>>>>> Subscription: >>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users> >>>>>>> Link to this post: >>>>>>> <http://www.open-mpi.org/community/lists/users/2015/10/27778.php>http://www.open-mpi.org/community/lists/users/2015/10/27778.php >>>>>>> <http://www.open-mpi.org/community/lists/users/2015/10/27778.php> >>>>>> >>>>>> <affinity.c>_______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org <mailto:us...@open-mpi.org> >>>>>> Subscription: >>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users> >>>>>> Link to this post: >>>>>> <http://www.open-mpi.org/community/lists/users/2015/10/27781.php>http://www.open-mpi.org/community/lists/users/2015/10/27781.php >>>>>> <http://www.open-mpi.org/community/lists/users/2015/10/27781.php> >>>>> >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org <mailto:us...@open-mpi.org> >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users> >>>>> Link to this post: >>>>> http://www.open-mpi.org/community/lists/users/2015/10/27782.php >>>>> <http://www.open-mpi.org/community/lists/users/2015/10/27782.php> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org <mailto:us...@open-mpi.org> >>>> Subscription: >>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users> >>>> Link to this post: >>>> <http://www.open-mpi.org/community/lists/users/2015/10/27783.php>http://www.open-mpi.org/community/lists/users/2015/10/27783.php >>>> <http://www.open-mpi.org/community/lists/users/2015/10/27783.php> >>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org <mailto:us...@open-mpi.org> >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>> <http://www.open-mpi.org/mailman/listinfo.cgi/users> >>> Link to this post: >>> http://www.open-mpi.org/community/lists/users/2015/10/27784.php >>> <http://www.open-mpi.org/community/lists/users/2015/10/27784.php> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org <mailto:us...@open-mpi.org> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2015/10/27785.php >