Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

Gilles Gouaillardet Sun, 4 Oct 2015 03:55:24 -0400 (EDT)

Ralph,

I suspect ompi tries to bind to threads outside the cpuset.
this could be pretty similar to a previous issue when ompi tried to bind to
cores outside the cpuset.
/* when a core has more than one thread, would ompi assume all the threads
are available if the core is available ? */
I will investigate this from tomorrow


Cheers,

Gilles

On Sunday, October 4, 2015, Ralph Castain <r...@open-mpi.org> wrote:

> Thanks - please go ahead and release that allocation as I’m not going to
> get to this immediately. I’ve got several hot irons in the fire right now,
> and I’m not sure when I’ll get a chance to track this down.
>
> Gilles or anyone else who might have time - feel free to take a gander and
> see if something pops out at you.
>
> Ralph
>
>
> On Oct 3, 2015, at 11:05 AM, marcin.krotkiewski <
> marcin.krotkiew...@gmail.com
> <javascript:_e(%7B%7D,'cvml','marcin.krotkiew...@gmail.com');>> wrote:
>
>
> Done. I have compiled 1.10.0 and 1.10.rc1 with --enable-debug and executed
>
> mpirun --mca rmaps_base_verbose 10 --hetero-nodes --report-bindings
> --bind-to core -np 32 ./affinity
>
> In case of 1.10.rc1 I have also added :overload-allowed - output in a
> separate file. This option did not make much difference for 1.10.0, so I
> did not attach it here.
>
> First thing I noted for 1.10.0 are lines like
>
> [login-0-1.local:03399] [[37945,0],0] GOT 1 CPUS
> [login-0-1.local:03399] [[37945,0],0] PROC [[37945,1],27] BITMAP
> [login-0-1.local:03399] [[37945,0],0] PROC [[37945,1],27] ON c1-26 IS NOT
> BOUND
>
> with an empty BITMAP.
>
> The SLURM environment is
>
> set | grep SLURM
> SLURM_JOBID=12714491
> SLURM_JOB_CPUS_PER_NODE='4,2,5(x2),4,7,5'
> SLURM_JOB_ID=12714491
> SLURM_JOB_NODELIST='c1-[2,4,8,13,16,23,26]'
> SLURM_JOB_NUM_NODES=7
> SLURM_JOB_PARTITION=normal
> SLURM_MEM_PER_CPU=2048
> SLURM_NNODES=7
> SLURM_NODELIST='c1-[2,4,8,13,16,23,26]'
> SLURM_NODE_ALIASES='(null)'
> SLURM_NPROCS=32
> SLURM_NTASKS=32
> SLURM_SUBMIT_DIR=/cluster/home/marcink
> SLURM_SUBMIT_HOST=login-0-1.local
> SLURM_TASKS_PER_NODE='4,2,5(x2),4,7,5'
>
> I have submitted an interactive job on screen for 120 hours now to work
> with one example, and not change it for every post :)
>
> If you need anything else, let me know. I could introduce some
> patch/printfs and recompile, if you need it.
>
> Marcin
>
>
>
> On 10/03/2015 07:17 PM, Ralph Castain wrote:
>
> Rats - just realized I have no way to test this as none of the machines I
> can access are setup for cgroup-based multi-tenant. Is this a debug version
> of OMPI? If not, can you rebuild OMPI with —enable-debug?
>
> Then please run it with —mca rmaps_base_verbose 10 and pass along the
> output.
>
> Thanks
> Ralph
>
>
> On Oct 3, 2015, at 10:09 AM, Ralph Castain <r...@open-mpi.org
> <javascript:_e(%7B%7D,'cvml','r...@open-mpi.org');>> wrote:
>
> What version of slurm is this? I might try to debug it here. I’m not sure
> where the problem lies just yet.
>
>
> On Oct 3, 2015, at 8:59 AM, marcin.krotkiewski <
> marcin.krotkiew...@gmail.com
> <javascript:_e(%7B%7D,'cvml','marcin.krotkiew...@gmail.com');>> wrote:
>
> Here is the output of lstopo. In short, (0,16) are core 0, (1,17) - core 1
> etc.
>
> Machine (64GB)
>   NUMANode L#0 (P#0 32GB)
>     Socket L#0 + L3 L#0 (20MB)
>       L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
>         PU L#0 (P#0)
>         PU L#1 (P#16)
>       L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
>         PU L#2 (P#1)
>         PU L#3 (P#17)
>       L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
>         PU L#4 (P#2)
>         PU L#5 (P#18)
>       L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
>         PU L#6 (P#3)
>         PU L#7 (P#19)
>       L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
>         PU L#8 (P#4)
>         PU L#9 (P#20)
>       L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
>         PU L#10 (P#5)
>         PU L#11 (P#21)
>       L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6
>         PU L#12 (P#6)
>         PU L#13 (P#22)
>       L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7
>         PU L#14 (P#7)
>         PU L#15 (P#23)
>     HostBridge L#0
>       PCIBridge
>         PCI 8086:1521
>           Net L#0 "eth0"
>         PCI 8086:1521
>           Net L#1 "eth1"
>       PCIBridge
>         PCI 15b3:1003
>           Net L#2 "ib0"
>           OpenFabrics L#3 "mlx4_0"
>       PCIBridge
>         PCI 102b:0532
>       PCI 8086:1d02
>         Block L#4 "sda"
>   NUMANode L#1 (P#1 32GB) + Socket L#1 + L3 L#1 (20MB)
>     L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8
>       PU L#16 (P#8)
>       PU L#17 (P#24)
>     L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9
>       PU L#18 (P#9)
>       PU L#19 (P#25)
>     L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10
>       PU L#20 (P#10)
>       PU L#21 (P#26)
>     L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11
>       PU L#22 (P#11)
>       PU L#23 (P#27)
>     L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12
>       PU L#24 (P#12)
>       PU L#25 (P#28)
>     L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13
>       PU L#26 (P#13)
>       PU L#27 (P#29)
>     L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14
>       PU L#28 (P#14)
>       PU L#29 (P#30)
>     L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15
>       PU L#30 (P#15)
>       PU L#31 (P#31)
>
>
>
> On 10/03/2015 05:46 PM, Ralph Castain wrote:
>
> Maybe I’m just misreading your HT map - that slurm nodelist syntax is a
> new one to me, but they tend to change things around. Could you run lstopo
> on one of those compute nodes and send the output?
>
> I’m just suspicious because I’m not seeing a clear pairing of HT numbers
> in your output, but HT numbering is BIOS-specific and I may just not be
> understanding your particular pattern. Our error message is clearly
> indicating that we are seeing individual HTs (and not complete cores)
> assigned, and I don’t know the source of that confusion.
>
>
> On Oct 3, 2015, at 8:28 AM, marcin.krotkiewski <
> <javascript:_e(%7B%7D,'cvml','marcin.krotkiew...@gmail.com');>
> marcin.krotkiew...@gmail.com
> <javascript:_e(%7B%7D,'cvml','marcin.krotkiew...@gmail.com');>> wrote:
>
>
> On 10/03/2015 04:38 PM, Ralph Castain wrote:
>
> If mpirun isn’t trying to do any binding, then you will of course get the
> right mapping as we’ll just inherit whatever we received.
>
> Yes. I meant that whatever you received (what SLURM gives) is a correct
> cpu map and assigns _whole_ CPUs, not a single HT to MPI processes. In the
> case mentioned earlier openmpi should start 6 tasks on c1-30. If HT would
> be treated as separate and independent cores, sched_getaffinity of an MPI
> process started on c1-30 would return a map with 6 entries only. In my case
> it returns a map with 12 entries - 2 for each core. So one  process is in
> fact allocated both HTs, not only one. Is what I'm saying correct?
>
> Looking at your output, it’s pretty clear that you are getting independent
> HTs assigned and not full cores.
>
> How do you mean? Is the above understanding wrong? I would expect that on
> c1-30 with --bind-to core openmpi should bind to logical cores 0 and 16
> (rank 0), 1 and 17 (rank 2) and so on. All those logical cores are
> available in sched_getaffinity map, and there is twice as many logical
> cores as there are MPI processes started on the node.
>
> My guess is that something in slurm has changed such that it detects that
> HT has been enabled, and then begins treating the HTs as completely
> independent cpus.
>
> Try changing “-bind-to core” to “-bind-to hwthread  -use-hwthread-cpus”
> and see if that works
>
> I have and the binding is wrong. For example, I got this output
>
> rank 0 @ compute-1-30.local  0,
> rank 1 @ compute-1-30.local  16,
>
> Which means that two ranks have been bound to the same physical core
> (logical cores 0 and 16 are two HTs of the same core). If I use --bind-to
> core, I get the following correct binding
>
> rank 0 @ compute-1-30.local  0, 16,
>
> The problem is many other ranks get bad binding with 'rank XXX is not
> bound (or bound to all available processors)' warning.
>
> But I think I was not entirely correct saying that 1.10.1rc1 did not fix
> things. It still might have improved something, but not everything.
> Consider this job:
>
> SLURM_JOB_CPUS_PER_NODE='5,4,6,5(x2),7,5,9,5,7,6'
> SLURM_JOB_NODELIST='c8-[31,34],c9-[30-32,35-36],c10-[31-34]'
>
> If I run 32 tasks as follows (with 1.10.1rc1)
>
> mpirun --hetero-nodes --report-bindings --bind-to core -np 32 ./affinity
>
> I get the following error:
>
> --------------------------------------------------------------------------
> A request was made to bind to that would result in binding more
> processes than cpus on a resource:
>
>    Bind to:     CORE
>    Node:        c9-31
>    #processes:  2
>    #cpus:       1
>
> You can override this protection by adding the "overload-allowed"
> option to your binding directive.
> --------------------------------------------------------------------------
>
>
> If I now use --bind-to core:overload-allowed, then openmpi starts and
> _most_ of the threads are bound correctly (i.e., map contains two logical
> cores in ALL cases), except this case that required the overload flag:
>
> rank 15 @ compute-9-31.local   1, 17,
> rank 16 @ compute-9-31.local  11, 27,
> rank 17 @ compute-9-31.local   2, 18,
> rank 18 @ compute-9-31.local  12, 28,
> rank 19 @ compute-9-31.local   1, 17,
>
> Note pair 1,17 is used twice. The original SLURM delivered map (no
> binding) on this node is
>
> rank 15 @ compute-9-31.local  1, 2, 11, 12, 13, 17, 18, 27, 28, 29,
> rank 16 @ compute-9-31.local  1, 2, 11, 12, 13, 17, 18, 27, 28, 29,
> rank 17 @ compute-9-31.local  1, 2, 11, 12, 13, 17, 18, 27, 28, 29,
> rank 18 @ compute-9-31.local  1, 2, 11, 12, 13, 17, 18, 27, 28, 29,
> rank 19 @ compute-9-31.local  1, 2, 11, 12, 13, 17, 18, 27, 28, 29,
>
> Why does openmpi use cores (1,17) twice instead of using core (13,29)?
> Clearly, the original SLURM-delivered map has 5 CPUs included, enough for 5
> MPI processes.
>
> Cheers,
>
> Marcin
>
>
>
> On Oct 3, 2015, at 7:12 AM, marcin.krotkiewski <
> <javascript:_e(%7B%7D,'cvml','marcin.krotkiew...@gmail.com');>
> marcin.krotkiew...@gmail.com
> <javascript:_e(%7B%7D,'cvml','marcin.krotkiew...@gmail.com');>> wrote:
>
>
> On 10/03/2015 01:06 PM, Ralph Castain wrote:
>
> Thanks Marcin. Looking at this, I’m guessing that Slurm may be treating
> HTs as “cores” - i.e., as independent cpus. Any chance that is true?
>
> Not to the best of my knowledge, and at least not intentionally. SLURM
> starts as many processes as there are physical cores, not threads. To
> verify this, consider this test case:
>
>

Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

Reply via email to