Ralph, I suspect ompi tries to bind to threads outside the cpuset. this could be pretty similar to a previous issue when ompi tried to bind to cores outside the cpuset. /* when a core has more than one thread, would ompi assume all the threads are available if the core is available ? */ I will investigate this from tomorrow
Cheers, Gilles On Sunday, October 4, 2015, Ralph Castain <r...@open-mpi.org> wrote: > Thanks - please go ahead and release that allocation as I’m not going to > get to this immediately. I’ve got several hot irons in the fire right now, > and I’m not sure when I’ll get a chance to track this down. > > Gilles or anyone else who might have time - feel free to take a gander and > see if something pops out at you. > > Ralph > > > On Oct 3, 2015, at 11:05 AM, marcin.krotkiewski < > marcin.krotkiew...@gmail.com > <javascript:_e(%7B%7D,'cvml','marcin.krotkiew...@gmail.com');>> wrote: > > > Done. I have compiled 1.10.0 and 1.10.rc1 with --enable-debug and executed > > mpirun --mca rmaps_base_verbose 10 --hetero-nodes --report-bindings > --bind-to core -np 32 ./affinity > > In case of 1.10.rc1 I have also added :overload-allowed - output in a > separate file. This option did not make much difference for 1.10.0, so I > did not attach it here. > > First thing I noted for 1.10.0 are lines like > > [login-0-1.local:03399] [[37945,0],0] GOT 1 CPUS > [login-0-1.local:03399] [[37945,0],0] PROC [[37945,1],27] BITMAP > [login-0-1.local:03399] [[37945,0],0] PROC [[37945,1],27] ON c1-26 IS NOT > BOUND > > with an empty BITMAP. > > The SLURM environment is > > set | grep SLURM > SLURM_JOBID=12714491 > SLURM_JOB_CPUS_PER_NODE='4,2,5(x2),4,7,5' > SLURM_JOB_ID=12714491 > SLURM_JOB_NODELIST='c1-[2,4,8,13,16,23,26]' > SLURM_JOB_NUM_NODES=7 > SLURM_JOB_PARTITION=normal > SLURM_MEM_PER_CPU=2048 > SLURM_NNODES=7 > SLURM_NODELIST='c1-[2,4,8,13,16,23,26]' > SLURM_NODE_ALIASES='(null)' > SLURM_NPROCS=32 > SLURM_NTASKS=32 > SLURM_SUBMIT_DIR=/cluster/home/marcink > SLURM_SUBMIT_HOST=login-0-1.local > SLURM_TASKS_PER_NODE='4,2,5(x2),4,7,5' > > I have submitted an interactive job on screen for 120 hours now to work > with one example, and not change it for every post :) > > If you need anything else, let me know. I could introduce some > patch/printfs and recompile, if you need it. > > Marcin > > > > On 10/03/2015 07:17 PM, Ralph Castain wrote: > > Rats - just realized I have no way to test this as none of the machines I > can access are setup for cgroup-based multi-tenant. Is this a debug version > of OMPI? If not, can you rebuild OMPI with —enable-debug? > > Then please run it with —mca rmaps_base_verbose 10 and pass along the > output. > > Thanks > Ralph > > > On Oct 3, 2015, at 10:09 AM, Ralph Castain <r...@open-mpi.org > <javascript:_e(%7B%7D,'cvml','r...@open-mpi.org');>> wrote: > > What version of slurm is this? I might try to debug it here. I’m not sure > where the problem lies just yet. > > > On Oct 3, 2015, at 8:59 AM, marcin.krotkiewski < > marcin.krotkiew...@gmail.com > <javascript:_e(%7B%7D,'cvml','marcin.krotkiew...@gmail.com');>> wrote: > > Here is the output of lstopo. In short, (0,16) are core 0, (1,17) - core 1 > etc. > > Machine (64GB) > NUMANode L#0 (P#0 32GB) > Socket L#0 + L3 L#0 (20MB) > L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 > PU L#0 (P#0) > PU L#1 (P#16) > L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 > PU L#2 (P#1) > PU L#3 (P#17) > L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 > PU L#4 (P#2) > PU L#5 (P#18) > L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 > PU L#6 (P#3) > PU L#7 (P#19) > L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 > PU L#8 (P#4) > PU L#9 (P#20) > L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 > PU L#10 (P#5) > PU L#11 (P#21) > L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 > PU L#12 (P#6) > PU L#13 (P#22) > L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 > PU L#14 (P#7) > PU L#15 (P#23) > HostBridge L#0 > PCIBridge > PCI 8086:1521 > Net L#0 "eth0" > PCI 8086:1521 > Net L#1 "eth1" > PCIBridge > PCI 15b3:1003 > Net L#2 "ib0" > OpenFabrics L#3 "mlx4_0" > PCIBridge > PCI 102b:0532 > PCI 8086:1d02 > Block L#4 "sda" > NUMANode L#1 (P#1 32GB) + Socket L#1 + L3 L#1 (20MB) > L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 > PU L#16 (P#8) > PU L#17 (P#24) > L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 > PU L#18 (P#9) > PU L#19 (P#25) > L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10 > PU L#20 (P#10) > PU L#21 (P#26) > L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11 > PU L#22 (P#11) > PU L#23 (P#27) > L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12 > PU L#24 (P#12) > PU L#25 (P#28) > L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13 > PU L#26 (P#13) > PU L#27 (P#29) > L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14 > PU L#28 (P#14) > PU L#29 (P#30) > L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15 > PU L#30 (P#15) > PU L#31 (P#31) > > > > On 10/03/2015 05:46 PM, Ralph Castain wrote: > > Maybe I’m just misreading your HT map - that slurm nodelist syntax is a > new one to me, but they tend to change things around. Could you run lstopo > on one of those compute nodes and send the output? > > I’m just suspicious because I’m not seeing a clear pairing of HT numbers > in your output, but HT numbering is BIOS-specific and I may just not be > understanding your particular pattern. Our error message is clearly > indicating that we are seeing individual HTs (and not complete cores) > assigned, and I don’t know the source of that confusion. > > > On Oct 3, 2015, at 8:28 AM, marcin.krotkiewski < > <javascript:_e(%7B%7D,'cvml','marcin.krotkiew...@gmail.com');> > marcin.krotkiew...@gmail.com > <javascript:_e(%7B%7D,'cvml','marcin.krotkiew...@gmail.com');>> wrote: > > > On 10/03/2015 04:38 PM, Ralph Castain wrote: > > If mpirun isn’t trying to do any binding, then you will of course get the > right mapping as we’ll just inherit whatever we received. > > Yes. I meant that whatever you received (what SLURM gives) is a correct > cpu map and assigns _whole_ CPUs, not a single HT to MPI processes. In the > case mentioned earlier openmpi should start 6 tasks on c1-30. If HT would > be treated as separate and independent cores, sched_getaffinity of an MPI > process started on c1-30 would return a map with 6 entries only. In my case > it returns a map with 12 entries - 2 for each core. So one process is in > fact allocated both HTs, not only one. Is what I'm saying correct? > > Looking at your output, it’s pretty clear that you are getting independent > HTs assigned and not full cores. > > How do you mean? Is the above understanding wrong? I would expect that on > c1-30 with --bind-to core openmpi should bind to logical cores 0 and 16 > (rank 0), 1 and 17 (rank 2) and so on. All those logical cores are > available in sched_getaffinity map, and there is twice as many logical > cores as there are MPI processes started on the node. > > My guess is that something in slurm has changed such that it detects that > HT has been enabled, and then begins treating the HTs as completely > independent cpus. > > Try changing “-bind-to core” to “-bind-to hwthread -use-hwthread-cpus” > and see if that works > > I have and the binding is wrong. For example, I got this output > > rank 0 @ compute-1-30.local 0, > rank 1 @ compute-1-30.local 16, > > Which means that two ranks have been bound to the same physical core > (logical cores 0 and 16 are two HTs of the same core). If I use --bind-to > core, I get the following correct binding > > rank 0 @ compute-1-30.local 0, 16, > > The problem is many other ranks get bad binding with 'rank XXX is not > bound (or bound to all available processors)' warning. > > But I think I was not entirely correct saying that 1.10.1rc1 did not fix > things. It still might have improved something, but not everything. > Consider this job: > > SLURM_JOB_CPUS_PER_NODE='5,4,6,5(x2),7,5,9,5,7,6' > SLURM_JOB_NODELIST='c8-[31,34],c9-[30-32,35-36],c10-[31-34]' > > If I run 32 tasks as follows (with 1.10.1rc1) > > mpirun --hetero-nodes --report-bindings --bind-to core -np 32 ./affinity > > I get the following error: > > -------------------------------------------------------------------------- > A request was made to bind to that would result in binding more > processes than cpus on a resource: > > Bind to: CORE > Node: c9-31 > #processes: 2 > #cpus: 1 > > You can override this protection by adding the "overload-allowed" > option to your binding directive. > -------------------------------------------------------------------------- > > > If I now use --bind-to core:overload-allowed, then openmpi starts and > _most_ of the threads are bound correctly (i.e., map contains two logical > cores in ALL cases), except this case that required the overload flag: > > rank 15 @ compute-9-31.local 1, 17, > rank 16 @ compute-9-31.local 11, 27, > rank 17 @ compute-9-31.local 2, 18, > rank 18 @ compute-9-31.local 12, 28, > rank 19 @ compute-9-31.local 1, 17, > > Note pair 1,17 is used twice. The original SLURM delivered map (no > binding) on this node is > > rank 15 @ compute-9-31.local 1, 2, 11, 12, 13, 17, 18, 27, 28, 29, > rank 16 @ compute-9-31.local 1, 2, 11, 12, 13, 17, 18, 27, 28, 29, > rank 17 @ compute-9-31.local 1, 2, 11, 12, 13, 17, 18, 27, 28, 29, > rank 18 @ compute-9-31.local 1, 2, 11, 12, 13, 17, 18, 27, 28, 29, > rank 19 @ compute-9-31.local 1, 2, 11, 12, 13, 17, 18, 27, 28, 29, > > Why does openmpi use cores (1,17) twice instead of using core (13,29)? > Clearly, the original SLURM-delivered map has 5 CPUs included, enough for 5 > MPI processes. > > Cheers, > > Marcin > > > > On Oct 3, 2015, at 7:12 AM, marcin.krotkiewski < > <javascript:_e(%7B%7D,'cvml','marcin.krotkiew...@gmail.com');> > marcin.krotkiew...@gmail.com > <javascript:_e(%7B%7D,'cvml','marcin.krotkiew...@gmail.com');>> wrote: > > > On 10/03/2015 01:06 PM, Ralph Castain wrote: > > Thanks Marcin. Looking at this, I’m guessing that Slurm may be treating > HTs as “cores” - i.e., as independent cpus. Any chance that is true? > > Not to the best of my knowledge, and at least not intentionally. SLURM > starts as many processes as there are physical cores, not threads. To > verify this, consider this test case: > >