Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

marcin.krotkiewski Mon, 5 Oct 2015 08:54:56 -0400 (EDT)

I have applied the patch to both 1.10.0 and 1.10.1rc1. For 1.10.0 it didnot help - I am not sure how much (if) you want pursue this.

For 1.10.1rc1 I was so far unable to reproduce any binding problems withjobs of up to 128 tasks. Some cosmetic suggestions. The warning it allstarted with says


MCW rank X is not bound (or bound to all available processors)

1. One thing I already mentioned is that this warning is only displayedwhen using --report-bindings, and instead of printing the actualbinding. I would suggest moving the warning somewhere else (maybe thebind_downwards/upwards functions?), and instead just show the binding inquestion. It might be trivial for homogeneous allocations, but isnon-obvious with the type of SLURM jobs discussed. Also, seeing thewarning only on the condition that --report-bindings was used,especially if the user specified binding policy manually, is IMHO wrong- OpenMPI should notify about failure somehow instead of quietly bindingto all cores.

2. Another question altogether is if the warning should exist at all(instead of error, as proposed by Ralph). I still get that warning, evenwith 1.10.1rc1, in situations, in which I think it should not bedisplayed. In the simplest case the warning is printed when only 1 MPItask is running on a node. Obviously the statement is correct since thetask is using all allocated CPUs, but its not useful. A more nontrivialcase is when using '--bind-to socket', and when all MPI ranks areallocated on only one socket. Again, effectively all MPI ranks use allassigned cores and the warning is technically speaking correct, butmisleading. Instead, as discussed in 1., it would be good to see theactual binding printed instead of the warning.

3. When I specify '--map-by hwthread --bind-to core', then I getmultiple MPI processes bound to the same physical core without actuallyspecifying --oversubscribe. Just a question whether it should be likethis, but maybe yes.



On 10/05/2015 11:00 AM, Ralph Castain wrote:

I think this is okay, in general. I would only make one change: Iwould only search for an alternative site if the binding policy wasn’tset by the user. If the user specifies a mapping/binding pattern, thenwe should error out as we cannot meet it.

I think that would result in a non-transparent behavior in certaincases. By default mapping is done by socket, and OpenMPI could behavedifferently if '--map-by socket' is explicitly supplied on the commandline - i.e., error out in jobs like discussed. Is this a good idea?

Introducing an error here is also a bit tricky. Consider allocating 5mpi processes to 2 sockets. You would get an error in this type ofdistribution:


socket 0: 2 jobs
socket 1: 3 jobs

but not in this one

socket 0: 3 jobs
socket 1: 2 jobs

just because you start counting from socket 0.

I did think of one alternative that might be worth considering. If wehave a hetero topology, then we know that things are going to be alittle unusual. In that case, we could just default to map-by core (orhwthread if —use-hwthread-cpus was given) and then things would befine even in non-symmetric topologies. Likewise, if we have ahomogeneous topology, we could just quickly check for symmetry on ourbase topology (the one we will use for mapping) and default to map-bycore if non-symmetric.

Having different default options for different cases becomes difficultto manage and understand. If I could vote, I would rather go for aninformative error. Or to switch to '--map-by core' as default for allcases ;) (probably not gonna happen..)

Removing support of '--map-by socket' altogether for this type of jobsis likely OK - don't know. I personally like the new way it works - ifthere are resources, use them. But if you end up removing thispossibility it would probably be good to put it into SLURM related docand produce some meaningful error.



Marcin

I suggest it only because we otherwise wind up with some oddballhybrid mapping scheme. In the case we have here, procs would be mappedby socket except where we have an extra core, where they would looklike they were mapped by core. Impossible to predict how the app willreact to it.


The alternative be a more predictable pattern - would it make more sense?

Ralph

On Oct 5, 2015, at 1:13 AM, Gilles Gouaillardet <gil...@rist.or.jp<mailto:gil...@rist.or.jp>> wrote:


Ralph and Marcin,

here is a proof of concept for a fix (assert should be replaced withproper error handling)

for v1.10 branch.
if you have any chance to test it, please let me know the results

Cheers,

Gilles

On 10/5/2015 1:08 PM, Gilles Gouaillardet wrote:

OK, i'll see what i can do :-)

On 10/5/2015 12:39 PM, Ralph Castain wrote:

I would consider that a bug, myself - if there is some resourceavailable, we should use it

On Oct 4, 2015, at 5:42 PM, Gilles Gouaillardet <gil...@rist.or.jp<mailto:gil...@rist.or.jp>> wrote:


Marcin,

i ran a simple test with v1.10.1rc1 under a cpuset with
- one core (two threads 0,16) on socket 0
- two cores (two threads each 8,9,24,25) on socket 1

$ mpirun -np 3 -bind-to core ./hello_c
--------------------------------------------------------------------------
A request was made to bind to that would result in binding more
processes than cpus on a resource:

   Bind to:     CORE
   Node:        rapid
   #processes:  2
   #cpus:       1

You can override this protection by adding the "overload-allowed"
option to your binding directive.
--------------------------------------------------------------------------

as you already pointed, default mapping is by socket.

so on one hand, we can consider this behavior is a feature :

we try to bind two processes to socket 0, so the --oversubscribeoption is required

(and it does what it should :

$ mpirun -np 3 -bind-to core --oversubscribe -report-bindings./hello_c[rapid:16278] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]:[BB/../../../../../../..][../../../../../../../..][rapid:16278] MCW rank 1 bound to socket 1[core 8[hwt 0-1]]:[../../../../../../../..][BB/../../../../../../..][rapid:16278] MCW rank 2 bound to socket 0[core 0[hwt 0-1]]:[BB/../../../../../../..][../../../../../../../..]Hello, world, I am 1 of 3, (Open MPI v1.10.1rc1, package: Open MPIgilles@rapid Distribution, ident: 1.10.1rc1, repo rev:v1.10.0-84-g15ae63f, Oct 03, 2015, 128)Hello, world, I am 2 of 3, (Open MPI v1.10.1rc1, package: Open MPIgilles@rapid Distribution, ident: 1.10.1rc1, repo rev:v1.10.0-84-g15ae63f, Oct 03, 2015, 128)Hello, world, I am 0 of 3, (Open MPI v1.10.1rc1, package: Open MPIgilles@rapid Distribution, ident: 1.10.1rc1, repo rev:v1.10.0-84-g15ae63f, Oct 03, 2015, 128)

and on the other hand, we could consider ompi should be a bitsmarter, and uses socket 1 for task 2 since socket 0 is fullyallocated and there is room on socket 1.


Ralph, any thoughts ? bug or feature ?


Marcin,

you mentionned you had one failure with 1.10.1rc1 and -bind-to core
could you please send the full details (script, allocation and output)
in your slurm script, you can do

srun -N $SLURM_NNODES -n $SLURM_NNODES --cpu_bind=none -l grepCpus_allowed_list /proc/self/status

before invoking mpirun

Cheers,

Gilles

On 10/4/2015 11:55 PM, marcin.krotkiewski wrote:

Hi, all,

I played a bit more and it seems that the problem results from

trg_obj = opal_hwloc_base_find_min_bound_target_under_obj()

called in rmaps_base_binding.c / bind_downwards being wrong. I donot know the reason, but I think I know when the problem happens(at least on 1.10.1rc1). It seems that by default openmpi maps bysocket. The error happens when for a given compute node there isa different number of cores used on each socket. Considerpreviously studied case (the debug outputs I sent in last post).c1-8, which was source of error, has 5 mpi processes assigned,and the cpuset is the following:


0, 5, 9, 13, 14, 16, 21, 25, 29, 30

Cores 0,5 are on socket 0, cores 9, 13, 14 are on socket 1.Binding progresses correctly up to and including core 13 (see endof file out.1.10.1rc2, before the error). That is 2 cores onsocket 0, and 2 cores on socket 1. Error is thrown when core 14should be bound - extra core on socket 1 with no correspondingcore on socket 0. At that point the returned trg_obj points tothe first core on the node (os_index 0, socket 0).

I have submitted a few other jobs and I always had an error insuch situation. Moreover, if I now use --map-by core instead ofsocket, the error is gone, and I get my expected binding:


rank 0 @ compute-1-2.local  1, 17,
rank 1 @ compute-1-2.local  2, 18,
rank 2 @ compute-1-2.local  3, 19,
rank 3 @ compute-1-2.local  4, 20,
rank 4 @ compute-1-4.local  1, 17,
rank 5 @ compute-1-4.local  15, 31,
rank 6 @ compute-1-8.local  0, 16,
rank 7 @ compute-1-8.local  5, 21,
rank 8 @ compute-1-8.local  9, 25,
rank 9 @ compute-1-8.local  13, 29,
rank 10 @ compute-1-8.local  14, 30,
rank 11 @ compute-1-13.local  3, 19,
rank 12 @ compute-1-13.local  4, 20,
rank 13 @ compute-1-13.local  5, 21,
rank 14 @ compute-1-13.local  6, 22,
rank 15 @ compute-1-13.local  7, 23,
rank 16 @ compute-1-16.local  12, 28,
rank 17 @ compute-1-16.local  13, 29,
rank 18 @ compute-1-16.local  14, 30,
rank 19 @ compute-1-16.local  15, 31,
rank 20 @ compute-1-23.local  2, 18,
rank 29 @ compute-1-26.local  11, 27,
rank 21 @ compute-1-23.local  3, 19,
rank 30 @ compute-1-26.local  13, 29,
rank 22 @ compute-1-23.local  4, 20,
rank 31 @ compute-1-26.local  15, 31,
rank 23 @ compute-1-23.local  8, 24,
rank 27 @ compute-1-26.local  1, 17,
rank 24 @ compute-1-23.local  13, 29,
rank 28 @ compute-1-26.local  6, 22,
rank 25 @ compute-1-23.local  14, 30,
rank 26 @ compute-1-23.local  15, 31,

Using --map-by core seems to fix the issue on 1.8.8, 1.10.0 and1.10.1rc1. However, there is still a difference in behaviorbetween 1.10.1rc1 and earlier versions. In the SLURM jobdescribed in last post, 1.10.1rc1 fails to bind only in 1 case,while the earlier versions fail in 21 out of 32 cases. Youmentioned there was a bug in hwloc. Not sure if it can explainthe difference in behavior.


Hope this helps to nail this down.

Marcin




On 10/04/2015 09:55 AM, Gilles Gouaillardet wrote:

Ralph,

I suspect ompi tries to bind to threads outside the cpuset.

this could be pretty similar to a previous issue when ompi triedto bind to cores outside the cpuset./* when a core has more than one thread, would ompi assume allthe threads are available if the core is available ? */

I will investigate this from tomorrow

Cheers,

Gilles

On Sunday, October 4, 2015, Ralph Castain <r...@open-mpi.org> wrote:

    Thanks - please go ahead and release that allocation as I’m
    not going to get to this immediately. I’ve got several hot
    irons in the fire right now, and I’m not sure when I’ll get
    a chance to track this down.

    Gilles or anyone else who might have time - feel free to
    take a gander and see if something pops out at you.

    Ralph

    On Oct 3, 2015, at 11:05 AM, marcin.krotkiewski
    <marcin.krotkiew...@gmail.com> wrote:


    Done. I have compiled 1.10.0 and 1.10.rc1 with
    --enable-debug and executed

    mpirun --mca rmaps_base_verbose 10 --hetero-nodes
    --report-bindings --bind-to core -np 32 ./affinity

    In case of 1.10.rc1 I have also added :overload-allowed -
    output in a separate file. This option did not make much
    difference for 1.10.0, so I did not attach it here.

    First thing I noted for 1.10.0 are lines like

    [login-0-1.local:03399] [[37945,0],0] GOT 1 CPUS
    [login-0-1.local:03399] [[37945,0],0] PROC [[37945,1],27]
    BITMAP
    [login-0-1.local:03399] [[37945,0],0] PROC [[37945,1],27]
    ON c1-26 IS NOT BOUND

    with an empty BITMAP.

    The SLURM environment is

    set | grep SLURM
    SLURM_JOBID=12714491
    SLURM_JOB_CPUS_PER_NODE='4,2,5(x2),4,7,5'
    SLURM_JOB_ID=12714491
    SLURM_JOB_NODELIST='c1-[2,4,8,13,16,23,26]'
    SLURM_JOB_NUM_NODES=7
    SLURM_JOB_PARTITION=normal
    SLURM_MEM_PER_CPU=2048
    SLURM_NNODES=7
    SLURM_NODELIST='c1-[2,4,8,13,16,23,26]'
    SLURM_NODE_ALIASES='(null)'
    SLURM_NPROCS=32
    SLURM_NTASKS=32
    SLURM_SUBMIT_DIR=/cluster/home/marcink
    SLURM_SUBMIT_HOST=login-0-1.local
    SLURM_TASKS_PER_NODE='4,2,5(x2),4,7,5'

    I have submitted an interactive job on screen for 120 hours
    now to work with one example, and not change it for every
    post :)

    If you need anything else, let me know. I could introduce
    some patch/printfs and recompile, if you need it.

    Marcin



    On 10/03/2015 07:17 PM, Ralph Castain wrote:

    Rats - just realized I have no way to test this as none of
    the machines I can access are setup for cgroup-based
    multi-tenant. Is this a debug version of OMPI? If not, can
    you rebuild OMPI with —enable-debug?

    Then please run it with —mca rmaps_base_verbose 10 and
    pass along the output.

    Thanks
    Ralph

    On Oct 3, 2015, at 10:09 AM, Ralph Castain
    <r...@open-mpi.org> wrote:

    What version of slurm is this? I might try to debug it
    here. I’m not sure where the problem lies just yet.

    On Oct 3, 2015, at 8:59 AM, marcin.krotkiewski
    <marcin.krotkiew...@gmail.com> wrote:

    Here is the output of lstopo. In short, (0,16) are core
    0, (1,17) - core 1 etc.

    Machine (64GB)
      NUMANode L#0 (P#0 32GB)
        Socket L#0 + L3 L#0 (20MB)
          L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) +
    Core L#0
            PU L#0 (P#0)
            PU L#1 (P#16)
          L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) +
    Core L#1
            PU L#2 (P#1)
            PU L#3 (P#17)
          L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) +
    Core L#2
            PU L#4 (P#2)
            PU L#5 (P#18)
          L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) +
    Core L#3
            PU L#6 (P#3)
            PU L#7 (P#19)
          L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) +
    Core L#4
            PU L#8 (P#4)
            PU L#9 (P#20)
          L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) +
    Core L#5
            PU L#10 (P#5)
            PU L#11 (P#21)
          L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) +
    Core L#6
            PU L#12 (P#6)
            PU L#13 (P#22)
          L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) +
    Core L#7
            PU L#14 (P#7)
            PU L#15 (P#23)
        HostBridge L#0
    PCIBridge
            PCI 8086:1521
              Net L#0 "eth0"
            PCI 8086:1521
              Net L#1 "eth1"
    PCIBridge
            PCI 15b3:1003
              Net L#2 "ib0"
    OpenFabrics L#3 "mlx4_0"
    PCIBridge
            PCI 102b:0532
          PCI 8086:1d02
            Block L#4 "sda"
      NUMANode L#1 (P#1 32GB) + Socket L#1 + L3 L#1 (20MB)
        L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) +
    Core L#8
          PU L#16 (P#8)
          PU L#17 (P#24)
        L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) +
    Core L#9
          PU L#18 (P#9)
          PU L#19 (P#25)
        L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB)
    + Core L#10
          PU L#20 (P#10)
          PU L#21 (P#26)
        L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB)
    + Core L#11
          PU L#22 (P#11)
          PU L#23 (P#27)
        L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB)
    + Core L#12
          PU L#24 (P#12)
          PU L#25 (P#28)
        L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB)
    + Core L#13
          PU L#26 (P#13)
          PU L#27 (P#29)
        L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB)
    + Core L#14
          PU L#28 (P#14)
          PU L#29 (P#30)
        L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB)
    + Core L#15
          PU L#30 (P#15)
          PU L#31 (P#31)



    On 10/03/2015 05:46 PM, Ralph Castain wrote:

    Maybe I’m just misreading your HT map - that slurm
    nodelist syntax is a new one to me, but they tend to
    change things around. Could you run lstopo on one of
    those compute nodes and send the output?

    I’m just suspicious because I’m not seeing a clear
    pairing of HT numbers in your output, but HT numbering
    is BIOS-specific and I may just not be understanding
    your particular pattern. Our error message is clearly
    indicating that we are seeing individual HTs (and not
    complete cores) assigned, and I don’t know the source
    of that confusion.

    On Oct 3, 2015, at 8:28 AM, marcin.krotkiewski
    <marcin.krotkiew...@gmail.com> wrote:


    On 10/03/2015 04:38 PM, Ralph Castain wrote:

    If mpirun isn’t trying to do any binding, then you
    will of course get the right mapping as we’ll just
    inherit whatever we received.

    Yes. I meant that whatever you received (what SLURM
    gives) is a correct cpu map and assigns _whole_ CPUs,
    not a single HT to MPI processes. In the case
    mentioned earlier openmpi should start 6 tasks on
    c1-30. If HT would be treated as separate and
    independent cores, sched_getaffinity of an MPI process
    started on c1-30 would return a map with 6 entries
    only. In my case it returns a map with 12 entries - 2
    for each core. So one process is in fact allocated
    both HTs, not only one. Is what I'm saying correct?

    Looking at your output, it’s pretty clear that you
    are getting independent HTs assigned and not full cores.

    How do you mean? Is the above understanding wrong? I
    would expect that on c1-30 with --bind-to core openmpi
    should bind to logical cores 0 and 16 (rank 0), 1 and
    17 (rank 2) and so on. All those logical cores are
    available in sched_getaffinity map, and there is twice
    as many logical cores as there are MPI processes
    started on the node.

    My guess is that something in slurm has changed such
    that it detects that HT has been enabled, and then
    begins treating the HTs as completely independent cpus.

    Try changing “-bind-to core” to “-bind-to hwthread
     -use-hwthread-cpus” and see if that works

    I have and the binding is wrong. For example, I got
    this output

    rank 0 @ compute-1-30.local 0,
    rank 1 @ compute-1-30.local 16,

    Which means that two ranks have been bound to the same
    physical core (logical cores 0 and 16 are two HTs of
    the same core). If I use --bind-to core, I get the
    following correct binding

    rank 0 @ compute-1-30.local 0, 16,

    The problem is many other ranks get bad binding with
    'rank XXX is not bound (or bound to all available
    processors)' warning.

    But I think I was not entirely correct saying that
    1.10.1rc1 did not fix things. It still might have
    improved something, but not everything. Consider this job:

    SLURM_JOB_CPUS_PER_NODE='5,4,6,5(x2),7,5,9,5,7,6'
    SLURM_JOB_NODELIST='c8-[31,34],c9-[30-32,35-36],c10-[31-34]'

    If I run 32 tasks as follows (with 1.10.1rc1)

    mpirun --hetero-nodes --report-bindings --bind-to core
    -np 32 ./affinity

    I get the following error:

    --------------------------------------------------------------------------
    A request was made to bind to that would result in
    binding more
    processes than cpus on a resource:

       Bind to:     CORE
    Node: c9-31
    #processes:  2
    #cpus:       1

    You can override this protection by adding the
    "overload-allowed"
    option to your binding directive.
    --------------------------------------------------------------------------


    If I now use --bind-to core:overload-allowed, then
    openmpi starts and _most_ of the threads are bound
    correctly (i.e., map contains two logical cores in ALL
    cases), except this case that required the overload flag:

    rank 15 @ compute-9-31.local 1, 17,
    rank 16 @ compute-9-31.local 11, 27,
    rank 17 @ compute-9-31.local 2, 18,
    rank 18 @ compute-9-31.local 12, 28,
    rank 19 @ compute-9-31.local 1, 17,

    Note pair 1,17 is used twice. The original SLURM
    delivered map (no binding) on this node is

    rank 15 @ compute-9-31.local 1, 2, 11, 12, 13, 17, 18,
    27, 28, 29,
    rank 16 @ compute-9-31.local 1, 2, 11, 12, 13, 17, 18,
    27, 28, 29,
    rank 17 @ compute-9-31.local 1, 2, 11, 12, 13, 17, 18,
    27, 28, 29,
    rank 18 @ compute-9-31.local 1, 2, 11, 12, 13, 17, 18,
    27, 28, 29,
    rank 19 @ compute-9-31.local 1, 2, 11, 12, 13, 17, 18,
    27, 28, 29,

    Why does openmpi use cores (1,17) twice instead of
    using core (13,29)? Clearly, the original
    SLURM-delivered map has 5 CPUs included, enough for 5
    MPI processes.

    Cheers,

    Marcin

    On Oct 3, 2015, at 7:12 AM, marcin.krotkiewski
    <marcin.krotkiew...@gmail.com> wrote:


    On 10/03/2015 01:06 PM, Ralph Castain wrote:

    Thanks Marcin. Looking at this, I’m guessing that
    Slurm may be treating HTs as “cores” - i.e., as
    independent cpus. Any chance that is true?

    Not to the best of my knowledge, and at least not
    intentionally. SLURM starts as many processes as
    there are physical cores, not threads. To verify
    this, consider this test case:




_______________________________________________
users mailing list
us...@open-mpi.org
Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this 
post:http://www.open-mpi.org/community/lists/users/2015/10/27790.php




_______________________________________________
users mailing list
us...@open-mpi.org
Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this 
post:http://www.open-mpi.org/community/lists/users/2015/10/27791.php


_______________________________________________
users mailing list
us...@open-mpi.org <mailto:us...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users

Link to this post:http://www.open-mpi.org/community/lists/users/2015/10/27792.php




_______________________________________________
users mailing list
us...@open-mpi.org
Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this 
post:http://www.open-mpi.org/community/lists/users/2015/10/27793.php




_______________________________________________
users mailing list
us...@open-mpi.org
Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this 
post:http://www.open-mpi.org/community/lists/users/2015/10/27794.php


<unbalanced.patch>_______________________________________________
users mailing list
us...@open-mpi.org <mailto:us...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users

Link to this post:http://www.open-mpi.org/community/lists/users/2015/10/27795.php




_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/10/27796.php

Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

Reply via email to