Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

marcin.krotkiewski Tue, 6 Oct 2015 06:22:15 -0400 (EDT)

Gilles,

Yes, it seemed that all was fine with binding in the patched 1.10.1rc1 -thank you. Eagerly waiting for the other patches, let me know and I willtest them later this week.


Marcin



On 10/06/2015 12:09 PM, Gilles Gouaillardet wrote:

Marcin,

my understanding is that in this case, patched v1.10.1rc1 is workingjust fine.

am I right ?

I prepared two patches

one to remove the warning when binding on one core if only one core isavailable,an other one to add a warning if the user asks a binding policy thatmakes no sense with the required mapping policy


I will finalize them tomorrow hopefully

Cheers,

Gilles

On Tuesday, October 6, 2015, marcin.krotkiewski<marcin.krotkiew...@gmail.com <mailto:marcin.krotkiew...@gmail.com>>wrote:


    Hi, Gilles

    you mentionned you had one failure with 1.10.1rc1 and -bind-to core
    could you please send the full details (script, allocation and
    output)
    in your slurm script, you can do
    srun -N $SLURM_NNODES -n $SLURM_NNODES --cpu_bind=none -l grep
    Cpus_allowed_list /proc/self/status
    before invoking mpirun

    It was an interactive job allocated with

    salloc --account=staff --ntasks=32 --mem-per-cpu=2G --time=120:0:0

    The slurm environment is the following

    SLURM_JOBID=12714491
    SLURM_JOB_CPUS_PER_NODE='4,2,5(x2),4,7,5'
    SLURM_JOB_ID=12714491
    SLURM_JOB_NODELIST='c1-[2,4,8,13,16,23,26]'
    SLURM_JOB_NUM_NODES=7
    SLURM_JOB_PARTITION=normal
    SLURM_MEM_PER_CPU=2048
    SLURM_NNODES=7
    SLURM_NODELIST='c1-[2,4,8,13,16,23,26]'
    SLURM_NODE_ALIASES='(null)'
    SLURM_NPROCS=32
    SLURM_NTASKS=32
    SLURM_SUBMIT_DIR=/cluster/home/marcink
    SLURM_SUBMIT_HOST=login-0-1.local
    SLURM_TASKS_PER_NODE='4,2,5(x2),4,7,5'

    The output of the command you asked for is

    0: c1-2.local  Cpus_allowed_list:        1-4,17-20
    1: c1-4.local  Cpus_allowed_list:        1,15,17,31
    2: c1-8.local  Cpus_allowed_list: 0,5,9,13-14,16,21,25,29-30
    3: c1-13.local  Cpus_allowed_list:       3-7,19-23
    4: c1-16.local  Cpus_allowed_list:       12-15,28-31
    5: c1-23.local  Cpus_allowed_list: 2-4,8,13-15,18-20,24,29-31
    6: c1-26.local  Cpus_allowed_list: 1,6,11,13,15,17,22,27,29,31

    Running with command

    mpirun --mca rmaps_base_verbose 10 --hetero-nodes --bind-to core
    --report-bindings --map-by socket -np 32 ./affinity

    I have attached two output files: one for the original 1.10.1rc1,
    one for the patched version.

    When I said 'failed in one case' I was not precise. I got an error
    on node c1-8, which was the first one to have different number of
    MPI processes on the two sockets. It would also fail on some later
    nodes, just that because of the error we never got there.

    Let me know if you need more.

    Marcin

    Cheers,

    Gilles

    On 10/4/2015 11:55 PM, marcin.krotkiewski wrote:

    Hi, all,

    I played a bit more and it seems that the problem results from

    trg_obj = opal_hwloc_base_find_min_bound_target_under_obj()

    called in rmaps_base_binding.c / bind_downwards being wrong. I
    do not know the reason, but I think I know when the problem
    happens (at least on 1.10.1rc1). It seems that by default
    openmpi maps by socket. The error happens when for a given
    compute node there is a different number of cores used on each
    socket. Consider previously studied case (the debug outputs I
    sent in last post). c1-8, which was source of error, has 5 mpi
    processes assigned, and the cpuset is the following:

    0, 5, 9, 13, 14, 16, 21, 25, 29, 30

    Cores 0,5 are on socket 0, cores 9, 13, 14 are on socket 1.
    Binding progresses correctly up to and including core 13 (see
    end of file out.1.10.1rc2, before the error). That is 2 cores on
    socket 0, and 2 cores on socket 1. Error is thrown when core 14
    should be bound - extra core on socket 1 with no corresponding
    core on socket 0. At that point the returned trg_obj points to
    the first core on the node (os_index 0, socket 0).

    I have submitted a few other jobs and I always had an error in
    such situation. Moreover, if I now use --map-by core instead of
    socket, the error is gone, and I get my expected binding:

    rank 0 @ compute-1-2.local  1, 17,
    rank 1 @ compute-1-2.local  2, 18,
    rank 2 @ compute-1-2.local  3, 19,
    rank 3 @ compute-1-2.local  4, 20,
    rank 4 @ compute-1-4.local  1, 17,
    rank 5 @ compute-1-4.local  15, 31,
    rank 6 @ compute-1-8.local  0, 16,
    rank 7 @ compute-1-8.local  5, 21,
    rank 8 @ compute-1-8.local  9, 25,
    rank 9 @ compute-1-8.local  13, 29,
    rank 10 @ compute-1-8.local  14, 30,
    rank 11 @ compute-1-13.local  3, 19,
    rank 12 @ compute-1-13.local  4, 20,
    rank 13 @ compute-1-13.local  5, 21,
    rank 14 @ compute-1-13.local  6, 22,
    rank 15 @ compute-1-13.local  7, 23,
    rank 16 @ compute-1-16.local  12, 28,
    rank 17 @ compute-1-16.local  13, 29,
    rank 18 @ compute-1-16.local  14, 30,
    rank 19 @ compute-1-16.local  15, 31,
    rank 20 @ compute-1-23.local  2, 18,
    rank 29 @ compute-1-26.local  11, 27,
    rank 21 @ compute-1-23.local  3, 19,
    rank 30 @ compute-1-26.local  13, 29,
    rank 22 @ compute-1-23.local  4, 20,
    rank 31 @ compute-1-26.local  15, 31,
    rank 23 @ compute-1-23.local  8, 24,
    rank 27 @ compute-1-26.local  1, 17,
    rank 24 @ compute-1-23.local  13, 29,
    rank 28 @ compute-1-26.local  6, 22,
    rank 25 @ compute-1-23.local  14, 30,
    rank 26 @ compute-1-23.local  15, 31,

    Using --map-by core seems to fix the issue on 1.8.8, 1.10.0 and
    1.10.1rc1. However, there is still a difference in behavior
    between 1.10.1rc1 and earlier versions. In the SLURM job
    described in last post, 1.10.1rc1 fails to bind only in 1 case,
    while the earlier versions fail in 21 out of 32 cases. You
    mentioned there was a bug in hwloc. Not sure if it can explain
    the difference in behavior.

    Hope this helps to nail this down.

    Marcin




    On 10/04/2015 09:55 AM, Gilles Gouaillardet wrote:

    Ralph,

    I suspect ompi tries to bind to threads outside the cpuset.
    this could be pretty similar to a previous issue when ompi
    tried to bind to cores outside the cpuset.
    /* when a core has more than one thread, would ompi assume all
    the threads are available if the core is available ? */
    I will investigate this from tomorrow

    Cheers,

    Gilles

    On Sunday, October 4, 2015, Ralph Castain <r...@open-mpi.org
    <javascript:_e(%7B%7D,'cvml','r...@open-mpi.org');>> wrote:

        Thanks - please go ahead and release that allocation as I’m
        not going to get to this immediately. I’ve got several hot
        irons in the fire right now, and I’m not sure when I’ll get
        a chance to track this down.

        Gilles or anyone else who might have time - feel free to
        take a gander and see if something pops out at you.

        Ralph

        On Oct 3, 2015, at 11:05 AM, marcin.krotkiewski
        <marcin.krotkiew...@gmail.com> wrote:


        Done. I have compiled 1.10.0 and 1.10.rc1 with
        --enable-debug and executed

        mpirun --mca rmaps_base_verbose 10 --hetero-nodes
        --report-bindings --bind-to core -np 32 ./affinity

        In case of 1.10.rc1 I have also added :overload-allowed -
        output in a separate file. This option did not make much
        difference for 1.10.0, so I did not attach it here.

        First thing I noted for 1.10.0 are lines like

        [login-0-1.local:03399] [[37945,0],0] GOT 1 CPUS
        [login-0-1.local:03399] [[37945,0],0] PROC [[37945,1],27]
        BITMAP
        [login-0-1.local:03399] [[37945,0],0] PROC [[37945,1],27]
        ON c1-26 IS NOT BOUND

        with an empty BITMAP.

        The SLURM environment is

        set | grep SLURM
        SLURM_JOBID=12714491
        SLURM_JOB_CPUS_PER_NODE='4,2,5(x2),4,7,5'
        SLURM_JOB_ID=12714491
        SLURM_JOB_NODELIST='c1-[2,4,8,13,16,23,26]'
        SLURM_JOB_NUM_NODES=7
        SLURM_JOB_PARTITION=normal
        SLURM_MEM_PER_CPU=2048
        SLURM_NNODES=7
        SLURM_NODELIST='c1-[2,4,8,13,16,23,26]'
        SLURM_NODE_ALIASES='(null)'
        SLURM_NPROCS=32
        SLURM_NTASKS=32
        SLURM_SUBMIT_DIR=/cluster/home/marcink
        SLURM_SUBMIT_HOST=login-0-1.local
        SLURM_TASKS_PER_NODE='4,2,5(x2),4,7,5'

        I have submitted an interactive job on screen for 120
        hours now to work with one example, and not change it for
        every post :)

        If you need anything else, let me know. I could introduce
        some patch/printfs and recompile, if you need it.

        Marcin



        On 10/03/2015 07:17 PM, Ralph Castain wrote:

        Rats - just realized I have no way to test this as none
        of the machines I can access are setup for cgroup-based
        multi-tenant. Is this a debug version of OMPI? If not,
        can you rebuild OMPI with —enable-debug?

        Then please run it with —mca rmaps_base_verbose 10 and
        pass along the output.

        Thanks
        Ralph

        On Oct 3, 2015, at 10:09 AM, Ralph Castain
        <r...@open-mpi.org
        <javascript:_e(%7B%7D,'cvml','r...@open-mpi.org');>> wrote:

        What version of slurm is this? I might try to debug it
        here. I’m not sure where the problem lies just yet.

        On Oct 3, 2015, at 8:59 AM, marcin.krotkiewski
        <marcin.krotkiew...@gmail.com
        <javascript:_e(%7B%7D,'cvml','marcin.krotkiew...@gmail.com');>>
        wrote:

        Here is the output of lstopo. In short, (0,16) are core
        0, (1,17) - core 1 etc.

        Machine (64GB)
          NUMANode L#0 (P#0 32GB)
            Socket L#0 + L3 L#0 (20MB)
              L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB)
        + Core L#0
                PU L#0 (P#0)
                PU L#1 (P#16)
              L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB)
        + Core L#1
                PU L#2 (P#1)
                PU L#3 (P#17)
              L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB)
        + Core L#2
                PU L#4 (P#2)
                PU L#5 (P#18)
              L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB)
        + Core L#3
                PU L#6 (P#3)
                PU L#7 (P#19)
              L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB)
        + Core L#4
                PU L#8 (P#4)
                PU L#9 (P#20)
              L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB)
        + Core L#5
                PU L#10 (P#5)
                PU L#11 (P#21)
              L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB)
        + Core L#6
                PU L#12 (P#6)
                PU L#13 (P#22)
              L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB)
        + Core L#7
                PU L#14 (P#7)
                PU L#15 (P#23)
            HostBridge L#0
        PCIBridge
                PCI 8086:1521
                  Net L#0 "eth0"
                PCI 8086:1521
                  Net L#1 "eth1"
        PCIBridge
                PCI 15b3:1003
                  Net L#2 "ib0"
        OpenFabrics L#3 "mlx4_0"
        PCIBridge
                PCI 102b:0532
              PCI 8086:1d02
                Block L#4 "sda"
          NUMANode L#1 (P#1 32GB) + Socket L#1 + L3 L#1 (20MB)
            L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) +
        Core L#8
              PU L#16 (P#8)
              PU L#17 (P#24)
            L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) +
        Core L#9
              PU L#18 (P#9)
              PU L#19 (P#25)
            L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB)
        + Core L#10
              PU L#20 (P#10)
              PU L#21 (P#26)
            L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB)
        + Core L#11
              PU L#22 (P#11)
              PU L#23 (P#27)
            L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB)
        + Core L#12
              PU L#24 (P#12)
              PU L#25 (P#28)
            L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB)
        + Core L#13
              PU L#26 (P#13)
              PU L#27 (P#29)
            L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB)
        + Core L#14
              PU L#28 (P#14)
              PU L#29 (P#30)
            L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB)
        + Core L#15
              PU L#30 (P#15)
              PU L#31 (P#31)



        On 10/03/2015 05:46 PM, Ralph Castain wrote:

        Maybe I’m just misreading your HT map - that slurm
        nodelist syntax is a new one to me, but they tend to
        change things around. Could you run lstopo on one of
        those compute nodes and send the output?

        I’m just suspicious because I’m not seeing a clear
        pairing of HT numbers in your output, but HT numbering
        is BIOS-specific and I may just not be understanding
        your particular pattern. Our error message is clearly
        indicating that we are seeing individual HTs (and not
        complete cores) assigned, and I don’t know the source
        of that confusion.

        On Oct 3, 2015, at 8:28 AM, marcin.krotkiewski
        <marcin.krotkiew...@gmail.com
        <javascript:_e(%7B%7D,'cvml','marcin.krotkiew...@gmail.com');>>
        wrote:


        On 10/03/2015 04:38 PM, Ralph Castain wrote:

        If mpirun isn’t trying to do any binding, then you
        will of course get the right mapping as we’ll just
        inherit whatever we received.

        Yes. I meant that whatever you received (what SLURM
        gives) is a correct cpu map and assigns _whole_ CPUs,
        not a single HT to MPI processes. In the case
        mentioned earlier openmpi should start 6 tasks on
        c1-30. If HT would be treated as separate and
        independent cores, sched_getaffinity of an MPI
        process started on c1-30 would return a map with 6
        entries only. In my case it returns a map with 12
        entries - 2 for each core. So one process is in fact
        allocated both HTs, not only one. Is what I'm saying
        correct?

        Looking at your output, it’s pretty clear that you
        are getting independent HTs assigned and not full cores.

        How do you mean? Is the above understanding wrong? I
        would expect that on c1-30 with --bind-to core
        openmpi should bind to logical cores 0 and 16 (rank
        0), 1 and 17 (rank 2) and so on. All those logical
        cores are available in sched_getaffinity map, and
        there is twice as many logical cores as there are MPI
        processes started on the node.

        My guess is that something in slurm has changed such
        that it detects that HT has been enabled, and then
        begins treating the HTs as completely independent cpus.

        Try changing “-bind-to core” to “-bind-to hwthread
         -use-hwthread-cpus” and see if that works

        I have and the binding is wrong. For example, I got
        this output

        rank 0 @ compute-1-30.local 0,
        rank 1 @ compute-1-30.local 16,

        Which means that two ranks have been bound to the
        same physical core (logical cores 0 and 16 are two
        HTs of the same core). If I use --bind-to core, I get
        the following correct binding

        rank 0 @ compute-1-30.local 0, 16,

        The problem is many other ranks get bad binding with
        'rank XXX is not bound (or bound to all available
        processors)' warning.

        But I think I was not entirely correct saying that
        1.10.1rc1 did not fix things. It still might have
        improved something, but not everything. Consider this
        job:

        SLURM_JOB_CPUS_PER_NODE='5,4,6,5(x2),7,5,9,5,7,6'
        SLURM_JOB_NODELIST='c8-[31,34],c9-[30-32,35-36],c10-[31-34]'

        If I run 32 tasks as follows (with 1.10.1rc1)

        mpirun --hetero-nodes --report-bindings --bind-to
        core -np 32 ./affinity

        I get the following error:

        
--------------------------------------------------------------------------
        A request was made to bind to that would result in
        binding more
        processes than cpus on a resource:

           Bind to:     CORE
        Node: c9-31
        #processes:  2
        #cpus:       1

        You can override this protection by adding the
        "overload-allowed"
        option to your binding directive.
        
--------------------------------------------------------------------------


        If I now use --bind-to core:overload-allowed, then
        openmpi starts and _most_ of the threads are bound
        correctly (i.e., map contains two logical cores in
        ALL cases), except this case that required the
        overload flag:

        rank 15 @ compute-9-31.local 1, 17,
        rank 16 @ compute-9-31.local 11, 27,
        rank 17 @ compute-9-31.local 2, 18,
        rank 18 @ compute-9-31.local 12, 28,
        rank 19 @ compute-9-31.local 1, 17,

        Note pair 1,17 is used twice. The original SLURM
        delivered map (no binding) on this node is

        rank 15 @ compute-9-31.local 1, 2, 11, 12, 13, 17,
        18, 27, 28, 29,
        rank 16 @ compute-9-31.local 1, 2, 11, 12, 13, 17,
        18, 27, 28, 29,
        rank 17 @ compute-9-31.local 1, 2, 11, 12, 13, 17,
        18, 27, 28, 29,
        rank 18 @ compute-9-31.local 1, 2, 11, 12, 13, 17,
        18, 27, 28, 29,
        rank 19 @ compute-9-31.local 1, 2, 11, 12, 13, 17,
        18, 27, 28, 29,

        Why does openmpi use cores (1,17) twice instead of
        using core (13,29)? Clearly, the original
        SLURM-delivered map has 5 CPUs included, enough for 5
        MPI processes.

        Cheers,

        Marcin

        On Oct 3, 2015, at 7:12 AM, marcin.krotkiewski
        <marcin.krotkiew...@gmail.com
        <javascript:_e(%7B%7D,'cvml','marcin.krotkiew...@gmail.com');>>
        wrote:


        On 10/03/2015 01:06 PM, Ralph Castain wrote:

        Thanks Marcin. Looking at this, I’m guessing that
        Slurm may be treating HTs as “cores” - i.e., as
        independent cpus. Any chance that is true?

        Not to the best of my knowledge, and at least not
        intentionally. SLURM starts as many processes as
        there are physical cores, not threads. To verify
        this, consider this test case:




    _______________________________________________
    users mailing list
    us...@open-mpi.org
    <javascript:_e(%7B%7D,'cvml','us...@open-mpi.org');>
    Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/users
    Link to this 
post:http://www.open-mpi.org/community/lists/users/2015/10/27790.php




    _______________________________________________
    users mailing list
    us...@open-mpi.org
    <javascript:_e(%7B%7D,'cvml','us...@open-mpi.org');>
    Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/users
    Link to this 
post:http://www.open-mpi.org/community/lists/users/2015/10/27791.php




    _______________________________________________
    users mailing list
    us...@open-mpi.org
    <javascript:_e(%7B%7D,'cvml','us...@open-mpi.org');>
    Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/users
    Link to this 
post:http://www.open-mpi.org/community/lists/users/2015/10/27792.php




_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/10/27814.php

Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

Reply via email to