Marcin,

there is no need to pursue 1.10.0 since it is known to be broken for some
scenario.

it would really help me if you could provide the logs I requested, so I can
reproduce the issue and make sure we both talk about the same scenario.

imho, there is no legitimate reason to -map-by hwthread -bind-to core.
we might even want to issue a warning to tell the end user he might not be
doing what he expects.

I will double check the warning about one task using all the cores if there
is only one core available.
there should be no warning at all in this case.

I will give mote thoughts to the alternative suggested by Ralph.
imho, bad things will happen whatever the policy we choose if slurm
assigns more than one job per socket : most real world applications are
memory bound, and sharing a socket makes the performance very unpredictable
anyway, and regardless ompi binding policy.

Cheers,

Gilles

On Monday, October 5, 2015, marcin.krotkiewski <marcin.krotkiew...@gmail.com>
wrote:

>
> I have applied the patch to both 1.10.0 and 1.10.1rc1. For 1.10.0 it did
> not help - I am not sure how much (if) you want pursue this.
>
> For 1.10.1rc1 I was so far unable to reproduce any binding problems with
> jobs of up to 128 tasks. Some cosmetic suggestions. The warning it all
> started with says
>
> MCW rank X is not bound (or bound to all available processors)
>
> 1. One thing I already mentioned is that this warning is only displayed
> when using --report-bindings, and instead of printing the actual binding. I
> would suggest moving the warning somewhere else (maybe the
> bind_downwards/upwards functions?), and instead just show the binding in
> question. It might be trivial for homogeneous allocations, but is
> non-obvious with the type of SLURM jobs discussed. Also, seeing the warning
> only on the condition that --report-bindings was used, especially if the
> user specified binding policy manually, is IMHO wrong - OpenMPI should
> notify about failure somehow instead of quietly binding to all cores.
>
> 2. Another question altogether is if the warning should exist at all
> (instead of error, as proposed by Ralph). I still get that warning, even
> with 1.10.1rc1, in situations, in which I think it should not be displayed.
> In the simplest case the warning is printed when only 1 MPI task is running
> on a node. Obviously the statement is correct since the task is using all
> allocated CPUs, but its not useful. A more nontrivial case is when using
> '--bind-to socket', and when all MPI ranks are allocated on only one
> socket. Again, effectively all MPI ranks use all assigned cores and the
> warning is technically speaking correct, but misleading. Instead, as
> discussed in 1., it would be good to see the actual binding printed instead
> of the warning.
>
> 3. When I specify '--map-by hwthread --bind-to core', then I get multiple
> MPI processes bound to the same physical core without actually specifying
> --oversubscribe. Just a question whether it should be like this, but maybe
> yes.
>
>
> On 10/05/2015 11:00 AM, Ralph Castain wrote:
>
> I think this is okay, in general. I would only make one change: I would
> only search for an alternative site if the binding policy wasn’t set by the
> user. If the user specifies a mapping/binding pattern, then we should error
> out as we cannot meet it.
>
>
> I think that would result in a non-transparent behavior in certain cases.
> By default mapping is done by socket, and OpenMPI could behave differently
> if '--map-by socket' is explicitly supplied on the command line - i.e.,
> error out in jobs like discussed. Is this a good idea?
>
> Introducing an error here is also a bit tricky. Consider allocating 5 mpi
> processes to 2 sockets. You would get an error in this type of distribution:
>
> socket 0: 2 jobs
> socket 1: 3 jobs
>
> but not in this one
>
> socket 0: 3 jobs
> socket 1: 2 jobs
>
> just because you start counting from socket 0.
>
> I did think of one alternative that might be worth considering. If we have
> a hetero topology, then we know that things are going to be a little
> unusual. In that case, we could just default to map-by core (or hwthread if
> —use-hwthread-cpus was given) and then things would be fine even in
> non-symmetric topologies. Likewise, if we have a homogeneous topology, we
> could just quickly check for symmetry on our base topology (the one we will
> use for mapping) and default to map-by core if non-symmetric.
>
>
> Having different default options for different cases becomes difficult to
> manage and understand. If I could vote, I would rather go for an
> informative error. Or to switch to '--map-by core' as default for all cases
> ;) (probably not gonna happen..)
>
> Removing support of '--map-by socket' altogether for this type of jobs is
> likely OK - don't know. I personally like the new way it works - if there
> are resources, use them. But if you end up removing this possibility it
> would probably be good to put it into SLURM related doc and produce some
> meaningful error.
>
>
> Marcin
>
>
> I suggest it only because we otherwise wind up with some oddball hybrid
> mapping scheme. In the case we have here, procs would be mapped by socket
> except where we have an extra core, where they would look like they were
> mapped by core. Impossible to predict how the app will react to it.
>
>
>
>
>
>
> The alternative be a more predictable pattern - would it make more sense?
>
> Ralph
>
>
> On Oct 5, 2015, at 1:13 AM, Gilles Gouaillardet <gil...@rist.or.jp
> <javascript:_e(%7B%7D,'cvml','gil...@rist.or.jp');>> wrote:
>
> Ralph and Marcin,
>
> here is a proof of concept for a fix (assert should be replaced with
> proper error handling)
> for v1.10 branch.
> if you have any chance to test it, please let me know the results
>
> Cheers,
>
> Gilles
>
> On 10/5/2015 1:08 PM, Gilles Gouaillardet wrote:
>
> OK, i'll see what i can do :-)
>
> On 10/5/2015 12:39 PM, Ralph Castain wrote:
>
> I would consider that a bug, myself - if there is some resource available,
> we should use it
>
>
> On Oct 4, 2015, at 5:42 PM, Gilles Gouaillardet <
> <javascript:_e(%7B%7D,'cvml','gil...@rist.or.jp');>gil...@rist.or.jp
> <javascript:_e(%7B%7D,'cvml','gil...@rist.or.jp');>> wrote:
>
> Marcin,
>
> i ran a simple test with v1.10.1rc1 under a cpuset with
> - one core (two threads 0,16) on socket 0
> - two cores (two threads each 8,9,24,25) on socket 1
>
> $ mpirun -np 3 -bind-to core ./hello_c
> --------------------------------------------------------------------------
> A request was made to bind to that would result in binding more
> processes than cpus on a resource:
>
>    Bind to:     CORE
>    Node:        rapid
>    #processes:  2
>    #cpus:       1
>
> You can override this protection by adding the "overload-allowed"
> option to your binding directive.
> --------------------------------------------------------------------------
>
> as you already pointed, default mapping is by socket.
>
> so on one hand, we can consider this behavior is a feature :
> we try to bind two processes to socket 0, so the --oversubscribe option is
> required
> (and it does what it should :
> $ mpirun -np 3 -bind-to core --oversubscribe -report-bindings ./hello_c
> [rapid:16278] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]:
> [BB/../../../../../../..][../../../../../../../..]
> [rapid:16278] MCW rank 1 bound to socket 1[core 8[hwt 0-1]]:
> [../../../../../../../..][BB/../../../../../../..]
> [rapid:16278] MCW rank 2 bound to socket 0[core 0[hwt 0-1]]:
> [BB/../../../../../../..][../../../../../../../..]
> Hello, world, I am 1 of 3, (Open MPI v1.10.1rc1, package: Open MPI
> gilles@rapid Distribution, ident: 1.10.1rc1, repo rev:
> v1.10.0-84-g15ae63f, Oct 03, 2015, 128)
> Hello, world, I am 2 of 3, (Open MPI v1.10.1rc1, package: Open MPI
> gilles@rapid Distribution, ident: 1.10.1rc1, repo rev:
> v1.10.0-84-g15ae63f, Oct 03, 2015, 128)
> Hello, world, I am 0 of 3, (Open MPI v1.10.1rc1, package: Open MPI
> gilles@rapid Distribution, ident: 1.10.1rc1, repo rev:
> v1.10.0-84-g15ae63f, Oct 03, 2015, 128)
>
> and on the other hand, we could consider ompi should be a bit smarter, and
> uses socket 1 for task 2 since socket 0 is fully allocated and there is
> room on socket 1.
>
> Ralph, any thoughts ? bug or feature ?
>
>
> Marcin,
>
> you mentionned you had one failure with 1.10.1rc1 and -bind-to core
> could you please send the full details (script, allocation and output)
> in your slurm script, you can do
> srun -N $SLURM_NNODES -n $SLURM_NNODES --cpu_bind=none -l grep
> Cpus_allowed_list /proc/self/status
> before invoking mpirun
>
> Cheers,
>
> Gilles
>
> On 10/4/2015 11:55 PM, marcin.krotkiewski wrote:
>
> Hi, all,
>
> I played a bit more and it seems that the problem results from
>
> trg_obj = opal_hwloc_base_find_min_bound_target_under_obj()
>
> called in rmaps_base_binding.c / bind_downwards being wrong. I do not know
> the reason, but I think I know when the problem happens (at least on
> 1.10.1rc1). It seems that by default openmpi maps by socket. The error
> happens when for a given compute node there is a different number of cores
> used on each socket. Consider previously studied case (the debug outputs I
> sent in last post). c1-8, which was source of error, has 5 mpi processes
> assigned, and the cpuset is the following:
>
> 0, 5, 9, 13, 14, 16, 21, 25, 29, 30
>
> Cores 0,5 are on socket 0, cores 9, 13, 14 are on socket 1. Binding
> progresses correctly up to and including core 13 (see end of file
> out.1.10.1rc2, before the error). That is 2 cores on socket 0, and 2 cores
> on socket 1. Error is thrown when core 14 should be bound - extra core on
> socket 1 with no corresponding core on socket 0. At that point the returned
> trg_obj points to the first core on the node (os_index 0, socket 0).
>
> I have submitted a few other jobs and I always had an error in such
> situation. Moreover, if I now use --map-by core instead of socket, the
> error is gone, and I get my expected binding:
>
> rank 0 @ compute-1-2.local  1, 17,
> rank 1 @ compute-1-2.local  2, 18,
> rank 2 @ compute-1-2.local  3, 19,
> rank 3 @ compute-1-2.local  4, 20,
> rank 4 @ compute-1-4.local  1, 17,
> rank 5 @ compute-1-4.local  15, 31,
> rank 6 @ compute-1-8.local  0, 16,
> rank 7 @ compute-1-8.local  5, 21,
> rank 8 @ compute-1-8.local  9, 25,
> rank 9 @ compute-1-8.local  13, 29,
> rank 10 @ compute-1-8.local  14, 30,
> rank 11 @ compute-1-13.local  3, 19,
> rank 12 @ compute-1-13.local  4, 20,
> rank 13 @ compute-1-13.local  5, 21,
> rank 14 @ compute-1-13.local  6, 22,
> rank 15 @ compute-1-13.local  7, 23,
> rank 16 @ compute-1-16.local  12, 28,
> rank 17 @ compute-1-16.local  13, 29,
> rank 18 @ compute-1-16.local  14, 30,
> rank 19 @ compute-1-16.local  15, 31,
> rank 20 @ compute-1-23.local  2, 18,
> rank 29 @ compute-1-26.local  11, 27,
> rank 21 @ compute-1-23.local  3, 19,
> rank 30 @ compute-1-26.local  13, 29,
> rank 22 @ compute-1-23.local  4, 20,
> rank 31 @ compute-1-26.local  15, 31,
> rank 23 @ compute-1-23.local  8, 24,
> rank 27 @ compute-1-26.local  1, 17,
> rank 24 @ compute-1-23.local  13, 29,
> rank 28 @ compute-1-26.local  6, 22,
> rank 25 @ compute-1-23.local  14, 30,
> rank 26 @ compute-1-23.local  15, 31,
>
> Using --map-by core seems to fix the issue on 1.8.8, 1.10.0 and 1.10.1rc1.
> However, there is still a difference in behavior between 1.10.1rc1 and
> earlier versions. In the SLURM job described in last post, 1.10.1rc1 fails
> to bind only in 1 case, while the earlier versions fail in 21 out of 32
> cases. You mentioned there was a bug in hwloc. Not sure if it can explain
> the difference in behavior.
>
> Hope this helps to nail this down.
>
> Marcin
>
>
>
>
> On 10/04/2015 09:55 AM, Gilles Gouaillardet wrote:
>
> Ralph,
>
> I suspect ompi tries to bind to threads outside the cpuset.
> this could be pretty similar to a previous issue when ompi tried to bind
> to cores outside the cpuset.
> /* when a core has more than one thread, would ompi assume all the threads
> are available if the core is available ? */
> I will investigate this from tomorrow
>
> Cheers,
>
> Gilles
>
> On Sunday, October 4, 2015, Ralph Castain <
> <javascript:_e(%7B%7D,'cvml','r...@open-mpi.org');>r...@open-mpi.org
> <javascript:_e(%7B%7D,'cvml','r...@open-mpi.org');>> wrote:
>
>> Thanks - please go ahead and release that allocation as I’m not going to
>> get to this immediately. I’ve got several hot irons in the fire right now,
>> and I’m not sure when I’ll get a chance to track this down.
>>
>> Gilles or anyone else who might have time - feel free to take a gander
>> and see if something pops out at you.
>>
>> Ralph
>>
>>
>> On Oct 3, 2015, at 11:05 AM, marcin.krotkiewski <
>> <javascript:_e(%7B%7D,'cvml','marcin.krotkiew...@gmail.com');>
>> marcin.krotkiew...@gmail.com
>> <javascript:_e(%7B%7D,'cvml','marcin.krotkiew...@gmail.com');>> wrote:
>>
>>
>> Done. I have compiled 1.10.0 and 1.10.rc1 with --enable-debug and executed
>>
>> mpirun --mca rmaps_base_verbose 10 --hetero-nodes --report-bindings
>> --bind-to core -np 32 ./affinity
>>
>> In case of 1.10.rc1 I have also added :overload-allowed - output in a
>> separate file. This option did not make much difference for 1.10.0, so I
>> did not attach it here.
>>
>> First thing I noted for 1.10.0 are lines like
>>
>> [login-0-1.local:03399] [[37945,0],0] GOT 1 CPUS
>> [login-0-1.local:03399] [[37945,0],0] PROC [[37945,1],27] BITMAP
>> [login-0-1.local:03399] [[37945,0],0] PROC [[37945,1],27] ON c1-26 IS NOT
>> BOUND
>>
>> with an empty BITMAP.
>>
>> The SLURM environment is
>>
>> set | grep SLURM
>> SLURM_JOBID=12714491
>> SLURM_JOB_CPUS_PER_NODE='4,2,5(x2),4,7,5'
>> SLURM_JOB_ID=12714491
>> SLURM_JOB_NODELIST='c1-[2,4,8,13,16,23,26]'
>> SLURM_JOB_NUM_NODES=7
>> SLURM_JOB_PARTITION=normal
>> SLURM_MEM_PER_CPU=2048
>> SLURM_NNODES=7
>> SLURM_NODELIST='c1-[2,4,8,13,16,23,26]'
>> SLURM_NODE_ALIASES='(null)'
>> SLURM_NPROCS=32
>> SLURM_NTASKS=32
>> SLURM_SUBMIT_DIR=/cluster/home/marcink
>> SLURM_SUBMIT_HOST=login-0-1.local
>> SLURM_TASKS_PER_NODE='4,2,5(x2),4,7,5'
>>
>> I have submitted an interactive job on screen for 120 hours now to work
>> with one example, and not change it for every post :)
>>
>> If you need anything else, let me know. I could introduce some
>> patch/printfs and recompile, if you need it.
>>
>> Marcin
>>
>>
>>
>> On 10/03/2015 07:17 PM, Ralph Castain wrote:
>>
>> Rats - just realized I have no way to test this as none of the machines I
>> can access are setup for cgroup-based multi-tenant. Is this a debug version
>> of OMPI? If not, can you rebuild OMPI with —enable-debug?
>>
>> Then please run it with —mca rmaps_base_verbose 10 and pass along the
>> output.
>>
>> Thanks
>> Ralph
>>
>>
>> On Oct 3, 2015, at 10:09 AM, Ralph Castain <
>> <javascript:_e(%7B%7D,'cvml','r...@open-mpi.org');>r...@open-mpi.org
>> <javascript:_e(%7B%7D,'cvml','r...@open-mpi.org');>> wrote:
>>
>> What version of slurm is this? I might try to debug it here. I’m not sure
>> where the problem lies just yet.
>>
>>
>> On Oct 3, 2015, at 8:59 AM, marcin.krotkiewski <
>> <javascript:_e(%7B%7D,'cvml','marcin.krotkiew...@gmail.com');>
>> marcin.krotkiew...@gmail.com
>> <javascript:_e(%7B%7D,'cvml','marcin.krotkiew...@gmail.com');>> wrote:
>>
>> Here is the output of lstopo. In short, (0,16) are core 0, (1,17) - core
>> 1 etc.
>>
>> Machine (64GB)
>>   NUMANode L#0 (P#0 32GB)
>>     Socket L#0 + L3 L#0 (20MB)
>>       L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
>>         PU L#0 (P#0)
>>         PU L#1 (P#16)
>>       L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
>>         PU L#2 (P#1)
>>         PU L#3 (P#17)
>>       L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
>>         PU L#4 (P#2)
>>         PU L#5 (P#18)
>>       L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
>>         PU L#6 (P#3)
>>         PU L#7 (P#19)
>>       L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
>>         PU L#8 (P#4)
>>         PU L#9 (P#20)
>>       L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
>>         PU L#10 (P#5)
>>         PU L#11 (P#21)
>>       L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6
>>         PU L#12 (P#6)
>>         PU L#13 (P#22)
>>       L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7
>>         PU L#14 (P#7)
>>         PU L#15 (P#23)
>>     HostBridge L#0
>>       PCIBridge
>>         PCI 8086:1521
>>           Net L#0 "eth0"
>>         PCI 8086:1521
>>           Net L#1 "eth1"
>>       PCIBridge
>>         PCI 15b3:1003
>>           Net L#2 "ib0"
>>           OpenFabrics L#3 "mlx4_0"
>>       PCIBridge
>>         PCI 102b:0532
>>       PCI 8086:1d02
>>         Block L#4 "sda"
>>   NUMANode L#1 (P#1 32GB) + Socket L#1 + L3 L#1 (20MB)
>>     L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8
>>       PU L#16 (P#8)
>>       PU L#17 (P#24)
>>     L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9
>>       PU L#18 (P#9)
>>       PU L#19 (P#25)
>>     L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10
>>       PU L#20 (P#10)
>>       PU L#21 (P#26)
>>     L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11
>>       PU L#22 (P#11)
>>       PU L#23 (P#27)
>>     L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12
>>       PU L#24 (P#12)
>>       PU L#25 (P#28)
>>     L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13
>>       PU L#26 (P#13)
>>       PU L#27 (P#29)
>>     L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14
>>       PU L#28 (P#14)
>>       PU L#29 (P#30)
>>     L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15
>>       PU L#30 (P#15)
>>       PU L#31 (P#31)
>>
>>
>>
>> On 10/03/2015 05:46 PM, Ralph Castain wrote:
>>
>> Maybe I’m just misreading your HT map - that slurm nodelist syntax is a
>> new one to me, but they tend to change things around. Could you run lstopo
>> on one of those compute nodes and send the output?
>>
>> I’m just suspicious because I’m not seeing a clear pairing of HT numbers
>> in your output, but HT numbering is BIOS-specific and I may just not be
>> understanding your particular pattern. Our error message is clearly
>> indicating that we are seeing individual HTs (and not complete cores)
>> assigned, and I don’t know the source of that confusion.
>>
>>
>> On Oct 3, 2015, at 8:28 AM, marcin.krotkiewski <
>> <javascript:_e(%7B%7D,'cvml','marcin.krotkiew...@gmail.com');>
>> marcin.krotkiew...@gmail.com
>> <javascript:_e(%7B%7D,'cvml','marcin.krotkiew...@gmail.com');>> wrote:
>>
>>
>> On 10/03/2015 04:38 PM, Ralph Castain wrote:
>>
>> If mpirun isn’t trying to do any binding, then you will of course get the
>> right mapping as we’ll just inherit whatever we received.
>>
>> Yes. I meant that whatever you received (what SLURM gives) is a correct
>> cpu map and assigns _whole_ CPUs, not a single HT to MPI processes. In the
>> case mentioned earlier openmpi should start 6 tasks on c1-30. If HT would
>> be treated as separate and independent cores, sched_getaffinity of an MPI
>> process started on c1-30 would return a map with 6 entries only. In my case
>> it returns a map with 12 entries - 2 for each core. So one  process is in
>> fact allocated both HTs, not only one. Is what I'm saying correct?
>>
>> Looking at your output, it’s pretty clear that you are getting
>> independent HTs assigned and not full cores.
>>
>> How do you mean? Is the above understanding wrong? I would expect that on
>> c1-30 with --bind-to core openmpi should bind to logical cores 0 and 16
>> (rank 0), 1 and 17 (rank 2) and so on. All those logical cores are
>> available in sched_getaffinity map, and there is twice as many logical
>> cores as there are MPI processes started on the node.
>>
>> My guess is that something in slurm has changed such that it detects that
>> HT has been enabled, and then begins treating the HTs as completely
>> independent cpus.
>>
>> Try changing “-bind-to core” to “-bind-to hwthread  -use-hwthread-cpus”
>> and see if that works
>>
>> I have and the binding is wrong. For example, I got this output
>>
>> rank 0 @ compute-1-30.local  0,
>> rank 1 @ compute-1-30.local  16,
>>
>> Which means that two ranks have been bound to the same physical core
>> (logical cores 0 and 16 are two HTs of the same core). If I use --bind-to
>> core, I get the following correct binding
>>
>> rank 0 @ compute-1-30.local  0, 16,
>>
>> The problem is many other ranks get bad binding with 'rank XXX is not
>> bound (or bound to all available processors)' warning.
>>
>> But I think I was not entirely correct saying that 1.10.1rc1 did not fix
>> things. It still might have improved something, but not everything.
>> Consider this job:
>>
>> SLURM_JOB_CPUS_PER_NODE='5,4,6,5(x2),7,5,9,5,7,6'
>> SLURM_JOB_NODELIST='c8-[31,34],c9-[30-32,35-36],c10-[31-34]'
>>
>> If I run 32 tasks as follows (with 1.10.1rc1)
>>
>> mpirun --hetero-nodes --report-bindings --bind-to core -np 32 ./affinity
>>
>> I get the following error:
>>
>> --------------------------------------------------------------------------
>> A request was made to bind to that would result in binding more
>>
>>

Reply via email to