Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

Gilles Gouaillardet Mon, 5 Oct 2015 09:29:28 -0400 (EDT)

Marcin,

there is no need to pursue 1.10.0 since it is known to be broken for some
scenario.


it would really help me if you could provide the logs I requested, so I can
reproduce the issue and make sure we both talk about the same scenario.

imho, there is no legitimate reason to -map-by hwthread -bind-to core.
we might even want to issue a warning to tell the end user he might not be
doing what he expects.

I will double check the warning about one task using all the cores if there
is only one core available.
there should be no warning at all in this case.

I will give mote thoughts to the alternative suggested by Ralph.
imho, bad things will happen whatever the policy we choose if slurm
assigns more than one job per socket : most real world applications are
memory bound, and sharing a socket makes the performance very unpredictable
anyway, and regardless ompi binding policy.

Cheers,

Gilles

On Monday, October 5, 2015, marcin.krotkiewski <marcin.krotkiew...@gmail.com>
wrote:

>
> I have applied the patch to both 1.10.0 and 1.10.1rc1. For 1.10.0 it did
> not help - I am not sure how much (if) you want pursue this.
>
> For 1.10.1rc1 I was so far unable to reproduce any binding problems with
> jobs of up to 128 tasks. Some cosmetic suggestions. The warning it all
> started with says
>
> MCW rank X is not bound (or bound to all available processors)
>
> 1. One thing I already mentioned is that this warning is only displayed
> when using --report-bindings, and instead of printing the actual binding. I
> would suggest moving the warning somewhere else (maybe the
> bind_downwards/upwards functions?), and instead just show the binding in
> question. It might be trivial for homogeneous allocations, but is
> non-obvious with the type of SLURM jobs discussed. Also, seeing the warning
> only on the condition that --report-bindings was used, especially if the
> user specified binding policy manually, is IMHO wrong - OpenMPI should
> notify about failure somehow instead of quietly binding to all cores.
>
> 2. Another question altogether is if the warning should exist at all
> (instead of error, as proposed by Ralph). I still get that warning, even
> with 1.10.1rc1, in situations, in which I think it should not be displayed.
> In the simplest case the warning is printed when only 1 MPI task is running
> on a node. Obviously the statement is correct since the task is using all
> allocated CPUs, but its not useful. A more nontrivial case is when using
> '--bind-to socket', and when all MPI ranks are allocated on only one
> socket. Again, effectively all MPI ranks use all assigned cores and the
> warning is technically speaking correct, but misleading. Instead, as
> discussed in 1., it would be good to see the actual binding printed instead
> of the warning.
>
> 3. When I specify '--map-by hwthread --bind-to core', then I get multiple
> MPI processes bound to the same physical core without actually specifying
> --oversubscribe. Just a question whether it should be like this, but maybe
> yes.
>
>
> On 10/05/2015 11:00 AM, Ralph Castain wrote:
>
> I think this is okay, in general. I would only make one change: I would
> only search for an alternative site if the binding policy wasn’t set by the
> user. If the user specifies a mapping/binding pattern, then we should error
> out as we cannot meet it.
>
>
> I think that would result in a non-transparent behavior in certain cases.
> By default mapping is done by socket, and OpenMPI could behave differently
> if '--map-by socket' is explicitly supplied on the command line - i.e.,
> error out in jobs like discussed. Is this a good idea?
>
> Introducing an error here is also a bit tricky. Consider allocating 5 mpi
> processes to 2 sockets. You would get an error in this type of distribution:
>
> socket 0: 2 jobs
> socket 1: 3 jobs
>
> but not in this one
>
> socket 0: 3 jobs
> socket 1: 2 jobs
>
> just because you start counting from socket 0.
>
> I did think of one alternative that might be worth considering. If we have
> a hetero topology, then we know that things are going to be a little
> unusual. In that case, we could just default to map-by core (or hwthread if
> —use-hwthread-cpus was given) and then things would be fine even in
> non-symmetric topologies. Likewise, if we have a homogeneous topology, we
> could just quickly check for symmetry on our base topology (the one we will
> use for mapping) and default to map-by core if non-symmetric.
>
>
> Having different default options for different cases becomes difficult to
> manage and understand. If I could vote, I would rather go for an
> informative error. Or to switch to '--map-by core' as default for all cases
> ;) (probably not gonna happen..)
>
> Removing support of '--map-by socket' altogether for this type of jobs is
> likely OK - don't know. I personally like the new way it works - if there
> are resources, use them. But if you end up removing this possibility it
> would probably be good to put it into SLURM related doc and produce some
> meaningful error.
>
>
> Marcin
>
>
> I suggest it only because we otherwise wind up with some oddball hybrid
> mapping scheme. In the case we have here, procs would be mapped by socket
> except where we have an extra core, where they would look like they were
> mapped by core. Impossible to predict how the app will react to it.
>
>
>
>
>
>
> The alternative be a more predictable pattern - would it make more sense?
>
> Ralph
>
>
> On Oct 5, 2015, at 1:13 AM, Gilles Gouaillardet <gil...@rist.or.jp
> <javascript:_e(%7B%7D,'cvml','gil...@rist.or.jp');>> wrote:
>
> Ralph and Marcin,
>
> here is a proof of concept for a fix (assert should be replaced with
> proper error handling)
> for v1.10 branch.
> if you have any chance to test it, please let me know the results
>
> Cheers,
>
> Gilles
>
> On 10/5/2015 1:08 PM, Gilles Gouaillardet wrote:
>
> OK, i'll see what i can do :-)
>
> On 10/5/2015 12:39 PM, Ralph Castain wrote:
>
> I would consider that a bug, myself - if there is some resource available,
> we should use it
>
>
> On Oct 4, 2015, at 5:42 PM, Gilles Gouaillardet <
> <javascript:_e(%7B%7D,'cvml','gil...@rist.or.jp');>gil...@rist.or.jp
> <javascript:_e(%7B%7D,'cvml','gil...@rist.or.jp');>> wrote:
>
> Marcin,
>
> i ran a simple test with v1.10.1rc1 under a cpuset with
> - one core (two threads 0,16) on socket 0
> - two cores (two threads each 8,9,24,25) on socket 1
>
> $ mpirun -np 3 -bind-to core ./hello_c
> --------------------------------------------------------------------------
> A request was made to bind to that would result in binding more
> processes than cpus on a resource:
>
>    Bind to:     CORE
>    Node:        rapid
>    #processes:  2
>    #cpus:       1
>
> You can override this protection by adding the "overload-allowed"
> option to your binding directive.
> --------------------------------------------------------------------------
>
> as you already pointed, default mapping is by socket.
>
> so on one hand, we can consider this behavior is a feature :
> we try to bind two processes to socket 0, so the --oversubscribe option is
> required
> (and it does what it should :
> $ mpirun -np 3 -bind-to core --oversubscribe -report-bindings ./hello_c
> [rapid:16278] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]:
> [BB/../../../../../../..][../../../../../../../..]
> [rapid:16278] MCW rank 1 bound to socket 1[core 8[hwt 0-1]]:
> [../../../../../../../..][BB/../../../../../../..]
> [rapid:16278] MCW rank 2 bound to socket 0[core 0[hwt 0-1]]:
> [BB/../../../../../../..][../../../../../../../..]
> Hello, world, I am 1 of 3, (Open MPI v1.10.1rc1, package: Open MPI
> gilles@rapid Distribution, ident: 1.10.1rc1, repo rev:
> v1.10.0-84-g15ae63f, Oct 03, 2015, 128)
> Hello, world, I am 2 of 3, (Open MPI v1.10.1rc1, package: Open MPI
> gilles@rapid Distribution, ident: 1.10.1rc1, repo rev:
> v1.10.0-84-g15ae63f, Oct 03, 2015, 128)
> Hello, world, I am 0 of 3, (Open MPI v1.10.1rc1, package: Open MPI
> gilles@rapid Distribution, ident: 1.10.1rc1, repo rev:
> v1.10.0-84-g15ae63f, Oct 03, 2015, 128)
>
> and on the other hand, we could consider ompi should be a bit smarter, and
> uses socket 1 for task 2 since socket 0 is fully allocated and there is
> room on socket 1.
>
> Ralph, any thoughts ? bug or feature ?
>
>
> Marcin,
>
> you mentionned you had one failure with 1.10.1rc1 and -bind-to core
> could you please send the full details (script, allocation and output)
> in your slurm script, you can do
> srun -N $SLURM_NNODES -n $SLURM_NNODES --cpu_bind=none -l grep
> Cpus_allowed_list /proc/self/status
> before invoking mpirun
>
> Cheers,
>
> Gilles
>
> On 10/4/2015 11:55 PM, marcin.krotkiewski wrote:
>
> Hi, all,
>
> I played a bit more and it seems that the problem results from
>
> trg_obj = opal_hwloc_base_find_min_bound_target_under_obj()
>
> called in rmaps_base_binding.c / bind_downwards being wrong. I do not know
> the reason, but I think I know when the problem happens (at least on
> 1.10.1rc1). It seems that by default openmpi maps by socket. The error
> happens when for a given compute node there is a different number of cores
> used on each socket. Consider previously studied case (the debug outputs I
> sent in last post). c1-8, which was source of error, has 5 mpi processes
> assigned, and the cpuset is the following:
>
> 0, 5, 9, 13, 14, 16, 21, 25, 29, 30
>
> Cores 0,5 are on socket 0, cores 9, 13, 14 are on socket 1. Binding
> progresses correctly up to and including core 13 (see end of file
> out.1.10.1rc2, before the error). That is 2 cores on socket 0, and 2 cores
> on socket 1. Error is thrown when core 14 should be bound - extra core on
> socket 1 with no corresponding core on socket 0. At that point the returned
> trg_obj points to the first core on the node (os_index 0, socket 0).
>
> I have submitted a few other jobs and I always had an error in such
> situation. Moreover, if I now use --map-by core instead of socket, the
> error is gone, and I get my expected binding:
>
> rank 0 @ compute-1-2.local  1, 17,
> rank 1 @ compute-1-2.local  2, 18,
> rank 2 @ compute-1-2.local  3, 19,
> rank 3 @ compute-1-2.local  4, 20,
> rank 4 @ compute-1-4.local  1, 17,
> rank 5 @ compute-1-4.local  15, 31,
> rank 6 @ compute-1-8.local  0, 16,
> rank 7 @ compute-1-8.local  5, 21,
> rank 8 @ compute-1-8.local  9, 25,
> rank 9 @ compute-1-8.local  13, 29,
> rank 10 @ compute-1-8.local  14, 30,
> rank 11 @ compute-1-13.local  3, 19,
> rank 12 @ compute-1-13.local  4, 20,
> rank 13 @ compute-1-13.local  5, 21,
> rank 14 @ compute-1-13.local  6, 22,
> rank 15 @ compute-1-13.local  7, 23,
> rank 16 @ compute-1-16.local  12, 28,
> rank 17 @ compute-1-16.local  13, 29,
> rank 18 @ compute-1-16.local  14, 30,
> rank 19 @ compute-1-16.local  15, 31,
> rank 20 @ compute-1-23.local  2, 18,
> rank 29 @ compute-1-26.local  11, 27,
> rank 21 @ compute-1-23.local  3, 19,
> rank 30 @ compute-1-26.local  13, 29,
> rank 22 @ compute-1-23.local  4, 20,
> rank 31 @ compute-1-26.local  15, 31,
> rank 23 @ compute-1-23.local  8, 24,
> rank 27 @ compute-1-26.local  1, 17,
> rank 24 @ compute-1-23.local  13, 29,
> rank 28 @ compute-1-26.local  6, 22,
> rank 25 @ compute-1-23.local  14, 30,
> rank 26 @ compute-1-23.local  15, 31,
>
> Using --map-by core seems to fix the issue on 1.8.8, 1.10.0 and 1.10.1rc1.
> However, there is still a difference in behavior between 1.10.1rc1 and
> earlier versions. In the SLURM job described in last post, 1.10.1rc1 fails
> to bind only in 1 case, while the earlier versions fail in 21 out of 32
> cases. You mentioned there was a bug in hwloc. Not sure if it can explain
> the difference in behavior.
>
> Hope this helps to nail this down.
>
> Marcin
>
>
>
>
> On 10/04/2015 09:55 AM, Gilles Gouaillardet wrote:
>
> Ralph,
>
> I suspect ompi tries to bind to threads outside the cpuset.
> this could be pretty similar to a previous issue when ompi tried to bind
> to cores outside the cpuset.
> /* when a core has more than one thread, would ompi assume all the threads
> are available if the core is available ? */
> I will investigate this from tomorrow
>
> Cheers,
>
> Gilles
>
> On Sunday, October 4, 2015, Ralph Castain <
> <javascript:_e(%7B%7D,'cvml','r...@open-mpi.org');>r...@open-mpi.org
> <javascript:_e(%7B%7D,'cvml','r...@open-mpi.org');>> wrote:
>
>> Thanks - please go ahead and release that allocation as I’m not going to
>> get to this immediately. I’ve got several hot irons in the fire right now,
>> and I’m not sure when I’ll get a chance to track this down.
>>
>> Gilles or anyone else who might have time - feel free to take a gander
>> and see if something pops out at you.
>>
>> Ralph
>>
>>
>> On Oct 3, 2015, at 11:05 AM, marcin.krotkiewski <
>> <javascript:_e(%7B%7D,'cvml','marcin.krotkiew...@gmail.com');>
>> marcin.krotkiew...@gmail.com
>> <javascript:_e(%7B%7D,'cvml','marcin.krotkiew...@gmail.com');>> wrote:
>>
>>
>> Done. I have compiled 1.10.0 and 1.10.rc1 with --enable-debug and executed
>>
>> mpirun --mca rmaps_base_verbose 10 --hetero-nodes --report-bindings
>> --bind-to core -np 32 ./affinity
>>
>> In case of 1.10.rc1 I have also added :overload-allowed - output in a
>> separate file. This option did not make much difference for 1.10.0, so I
>> did not attach it here.
>>
>> First thing I noted for 1.10.0 are lines like
>>
>> [login-0-1.local:03399] [[37945,0],0] GOT 1 CPUS
>> [login-0-1.local:03399] [[37945,0],0] PROC [[37945,1],27] BITMAP
>> [login-0-1.local:03399] [[37945,0],0] PROC [[37945,1],27] ON c1-26 IS NOT
>> BOUND
>>
>> with an empty BITMAP.
>>
>> The SLURM environment is
>>
>> set | grep SLURM
>> SLURM_JOBID=12714491
>> SLURM_JOB_CPUS_PER_NODE='4,2,5(x2),4,7,5'
>> SLURM_JOB_ID=12714491
>> SLURM_JOB_NODELIST='c1-[2,4,8,13,16,23,26]'
>> SLURM_JOB_NUM_NODES=7
>> SLURM_JOB_PARTITION=normal
>> SLURM_MEM_PER_CPU=2048
>> SLURM_NNODES=7
>> SLURM_NODELIST='c1-[2,4,8,13,16,23,26]'
>> SLURM_NODE_ALIASES='(null)'
>> SLURM_NPROCS=32
>> SLURM_NTASKS=32
>> SLURM_SUBMIT_DIR=/cluster/home/marcink
>> SLURM_SUBMIT_HOST=login-0-1.local
>> SLURM_TASKS_PER_NODE='4,2,5(x2),4,7,5'
>>
>> I have submitted an interactive job on screen for 120 hours now to work
>> with one example, and not change it for every post :)
>>
>> If you need anything else, let me know. I could introduce some
>> patch/printfs and recompile, if you need it.
>>
>> Marcin
>>
>>
>>
>> On 10/03/2015 07:17 PM, Ralph Castain wrote:
>>
>> Rats - just realized I have no way to test this as none of the machines I
>> can access are setup for cgroup-based multi-tenant. Is this a debug version
>> of OMPI? If not, can you rebuild OMPI with —enable-debug?
>>
>> Then please run it with —mca rmaps_base_verbose 10 and pass along the
>> output.
>>
>> Thanks
>> Ralph
>>
>>
>> On Oct 3, 2015, at 10:09 AM, Ralph Castain <
>> <javascript:_e(%7B%7D,'cvml','r...@open-mpi.org');>r...@open-mpi.org
>> <javascript:_e(%7B%7D,'cvml','r...@open-mpi.org');>> wrote:
>>
>> What version of slurm is this? I might try to debug it here. I’m not sure
>> where the problem lies just yet.
>>
>>
>> On Oct 3, 2015, at 8:59 AM, marcin.krotkiewski <
>> <javascript:_e(%7B%7D,'cvml','marcin.krotkiew...@gmail.com');>
>> marcin.krotkiew...@gmail.com
>> <javascript:_e(%7B%7D,'cvml','marcin.krotkiew...@gmail.com');>> wrote:
>>
>> Here is the output of lstopo. In short, (0,16) are core 0, (1,17) - core
>> 1 etc.
>>
>> Machine (64GB)
>>   NUMANode L#0 (P#0 32GB)
>>     Socket L#0 + L3 L#0 (20MB)
>>       L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
>>         PU L#0 (P#0)
>>         PU L#1 (P#16)
>>       L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
>>         PU L#2 (P#1)
>>         PU L#3 (P#17)
>>       L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
>>         PU L#4 (P#2)
>>         PU L#5 (P#18)
>>       L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
>>         PU L#6 (P#3)
>>         PU L#7 (P#19)
>>       L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
>>         PU L#8 (P#4)
>>         PU L#9 (P#20)
>>       L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
>>         PU L#10 (P#5)
>>         PU L#11 (P#21)
>>       L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6
>>         PU L#12 (P#6)
>>         PU L#13 (P#22)
>>       L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7
>>         PU L#14 (P#7)
>>         PU L#15 (P#23)
>>     HostBridge L#0
>>       PCIBridge
>>         PCI 8086:1521
>>           Net L#0 "eth0"
>>         PCI 8086:1521
>>           Net L#1 "eth1"
>>       PCIBridge
>>         PCI 15b3:1003
>>           Net L#2 "ib0"
>>           OpenFabrics L#3 "mlx4_0"
>>       PCIBridge
>>         PCI 102b:0532
>>       PCI 8086:1d02
>>         Block L#4 "sda"
>>   NUMANode L#1 (P#1 32GB) + Socket L#1 + L3 L#1 (20MB)
>>     L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8
>>       PU L#16 (P#8)
>>       PU L#17 (P#24)
>>     L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9
>>       PU L#18 (P#9)
>>       PU L#19 (P#25)
>>     L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10
>>       PU L#20 (P#10)
>>       PU L#21 (P#26)
>>     L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11
>>       PU L#22 (P#11)
>>       PU L#23 (P#27)
>>     L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12
>>       PU L#24 (P#12)
>>       PU L#25 (P#28)
>>     L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13
>>       PU L#26 (P#13)
>>       PU L#27 (P#29)
>>     L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14
>>       PU L#28 (P#14)
>>       PU L#29 (P#30)
>>     L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15
>>       PU L#30 (P#15)
>>       PU L#31 (P#31)
>>
>>
>>
>> On 10/03/2015 05:46 PM, Ralph Castain wrote:
>>
>> Maybe I’m just misreading your HT map - that slurm nodelist syntax is a
>> new one to me, but they tend to change things around. Could you run lstopo
>> on one of those compute nodes and send the output?
>>
>> I’m just suspicious because I’m not seeing a clear pairing of HT numbers
>> in your output, but HT numbering is BIOS-specific and I may just not be
>> understanding your particular pattern. Our error message is clearly
>> indicating that we are seeing individual HTs (and not complete cores)
>> assigned, and I don’t know the source of that confusion.
>>
>>
>> On Oct 3, 2015, at 8:28 AM, marcin.krotkiewski <
>> <javascript:_e(%7B%7D,'cvml','marcin.krotkiew...@gmail.com');>
>> marcin.krotkiew...@gmail.com
>> <javascript:_e(%7B%7D,'cvml','marcin.krotkiew...@gmail.com');>> wrote:
>>
>>
>> On 10/03/2015 04:38 PM, Ralph Castain wrote:
>>
>> If mpirun isn’t trying to do any binding, then you will of course get the
>> right mapping as we’ll just inherit whatever we received.
>>
>> Yes. I meant that whatever you received (what SLURM gives) is a correct
>> cpu map and assigns _whole_ CPUs, not a single HT to MPI processes. In the
>> case mentioned earlier openmpi should start 6 tasks on c1-30. If HT would
>> be treated as separate and independent cores, sched_getaffinity of an MPI
>> process started on c1-30 would return a map with 6 entries only. In my case
>> it returns a map with 12 entries - 2 for each core. So one  process is in
>> fact allocated both HTs, not only one. Is what I'm saying correct?
>>
>> Looking at your output, it’s pretty clear that you are getting
>> independent HTs assigned and not full cores.
>>
>> How do you mean? Is the above understanding wrong? I would expect that on
>> c1-30 with --bind-to core openmpi should bind to logical cores 0 and 16
>> (rank 0), 1 and 17 (rank 2) and so on. All those logical cores are
>> available in sched_getaffinity map, and there is twice as many logical
>> cores as there are MPI processes started on the node.
>>
>> My guess is that something in slurm has changed such that it detects that
>> HT has been enabled, and then begins treating the HTs as completely
>> independent cpus.
>>
>> Try changing “-bind-to core” to “-bind-to hwthread  -use-hwthread-cpus”
>> and see if that works
>>
>> I have and the binding is wrong. For example, I got this output
>>
>> rank 0 @ compute-1-30.local  0,
>> rank 1 @ compute-1-30.local  16,
>>
>> Which means that two ranks have been bound to the same physical core
>> (logical cores 0 and 16 are two HTs of the same core). If I use --bind-to
>> core, I get the following correct binding
>>
>> rank 0 @ compute-1-30.local  0, 16,
>>
>> The problem is many other ranks get bad binding with 'rank XXX is not
>> bound (or bound to all available processors)' warning.
>>
>> But I think I was not entirely correct saying that 1.10.1rc1 did not fix
>> things. It still might have improved something, but not everything.
>> Consider this job:
>>
>> SLURM_JOB_CPUS_PER_NODE='5,4,6,5(x2),7,5,9,5,7,6'
>> SLURM_JOB_NODELIST='c8-[31,34],c9-[30-32,35-36],c10-[31-34]'
>>
>> If I run 32 tasks as follows (with 1.10.1rc1)
>>
>> mpirun --hetero-nodes --report-bindings --bind-to core -np 32 ./affinity
>>
>> I get the following error:
>>
>> --------------------------------------------------------------------------
>> A request was made to bind to that would result in binding more
>>
>>

Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

Reply via email to