Dear Ralph, Gilles, and Jeff

Thanks a lot for your effort.. Understanding this problem has been a very interesting exercise for me that let me understand OpenMPI much better (I think:).

I have given it all a little more thought, and done some more tests on our production system, and I think that this is not exactly a corner-case. First of all, I suspect all of this holds for other job scheduling systems besides SLURM (to be thought about..). Moreover, on our system a rather common usage scenario involves SLURM job allocation using, e.g.,

salloc --ntasks=32

which results in very fragmented allocations - that's specific for the type of problems users use this cluster for, but it's a fact. Users then run the job using

mpirun ./program

For versions up to 1.10.0, with uneven resource allocation among compute nodes the default binding options used in OpenMPI in most cases result in some CPU cores not being present in the used cpuset at all, others being over/under-subscribed. This certainly is job-specific and depends on how fragmented the SLURM allocations are, but to give a scary number: in one case I started 512 tasks (1 per core), and OpenMPI binding created a cpuset that used only 271 cores, some of them being over/under-subscribed on top of that. Effectively, user gets 50% of what he asked for. As already discussed, this happens quietly - the user has no idea.

For version 1.10.1rc1 and up the situation is a bit different: it seems that in many cases all cores are present in the cpuset, just that the binding does not take place in a lot of cases. Instead, processes are bound to all cores allocated by SLURM. In other scenarios, as discussed before, some cores are over/under-subscribed. Again, this is done quietly.

In all cases what is needed is the --hetero-nodes switch. If I apply the patch that Gilles has posted, it seems to be enough for 1.10.1rc1 and up. The switch is not enough for earlier versions of OpenMPI and one needs --map-by core in addition.

Given all that I think some sort of fix would be in order soon. I agree with Ralph that to address this issue quickly a simplified fix would be a good choice. As Ralph has already pointed out (or at least how I understood it :) this would essentially involve activating --hetero-nodes by default, and using --map-by core in cases where the architecture is not homogeneous. Uncovering the warning so that the failure to bind is not silent is the last piece of puzzle. Maybe adding a sanity check to make sure all allocated resources are in use would be helpful - if not by default, then maybe with some flag.

Does all this make sense?

Again, thank you all for your help,

Marcin





On 10/07/2015 04:03 PM, Ralph Castain wrote:
I’m a little nervous about this one, Gilles. It’s doing a lot more than just addressing the immediate issue, and I’m concerned about any potential side-effects that we don’t fully unocver prior to release.

I’d suggest a two-pronged approach:

1. use my alternative method for 1.10.1 to solve the immediate issue. It only affects this one, rather unusual, corner-case that was reported here. So the impact can be easily contained and won’t impact anything else.

2. push your proposed solution to the master where it can soak for awhile and give us a chance to fully discover the secondary effects. Removing the unused and “not-allowed” cpus from the topology means a substantial scrub of the code base in a number of places, and your patch doesn’t really get them all. It’s going to take time to ensure everything is working correctly again.

HTH
Ralph

On Oct 7, 2015, at 4:29 AM, Gilles Gouaillardet <gilles.gouaillar...@gmail.com <mailto:gilles.gouaillar...@gmail.com>> wrote:

Jeff,

there are quite a lot of changes, I did not update master yet (need extra pairs of eyes to review this...) so unless you want to make rc2 today and rc3 a week later, it is imho way safer to wait for v1.10.2

Ralph,
any thoughts ?

Cheers,

Gilles

On Wednesday, October 7, 2015, Jeff Squyres (jsquyres) <jsquy...@cisco.com <mailto:jsquy...@cisco.com>> wrote:

    Is this something that needs to go into v1.10.1?

    If so, a PR needs to be filed ASAP.  We were supposed to make the
    next 1.10.1 RC yesterday, but slipped to today due to some last
    second patches.


    > On Oct 7, 2015, at 4:32 AM, Gilles Gouaillardet
    <gil...@rist.or.jp <javascript:;>> wrote:
    >
    > Marcin,
    >
    > here is a patch for the master, hopefully it fixes all the
    issues we discussed
    > i will make sure it applies fine vs latest 1.10 tarball from
    tomorrow
    >
    > Cheers,
    >
    > Gilles
    >
    >
    > On 10/6/2015 7:22 PM, marcin.krotkiewski wrote:
    >> Gilles,
    >>
    >> Yes, it seemed that all was fine with binding in the patched
    1.10.1rc1 - thank you. Eagerly waiting for the other patches, let
    me know and I will test them later this week.
    >>
    >> Marcin
    >>
    >>
    >>
    >> On 10/06/2015 12:09 PM, Gilles Gouaillardet wrote:
    >>> Marcin,
    >>>
    >>> my understanding is that in this case, patched v1.10.1rc1 is
    working just fine.
    >>> am I right ?
    >>>
    >>> I prepared two patches
    >>> one to remove the warning when binding on one core if only
    one core is available,
    >>> an other one to add a warning if the user asks a binding
    policy that makes no sense with the required mapping policy
    >>>
    >>> I will finalize them tomorrow hopefully
    >>>
    >>> Cheers,
    >>>
    >>> Gilles
    >>>
    >>> On Tuesday, October 6, 2015, marcin.krotkiewski
    <marcin.krotkiew...@gmail.com <javascript:;>> wrote:
    >>> Hi, Gilles
    >>>> you mentionned you had one failure with 1.10.1rc1 and
    -bind-to core
    >>>> could you please send the full details (script, allocation
    and output)
    >>>> in your slurm script, you can do
    >>>> srun -N $SLURM_NNODES -n $SLURM_NNODES --cpu_bind=none -l
    grep Cpus_allowed_list /proc/self/status
    >>>> before invoking mpirun
    >>>>
    >>> It was an interactive job allocated with
    >>>
    >>> salloc --account=staff --ntasks=32 --mem-per-cpu=2G
    --time=120:0:0
    >>>
    >>> The slurm environment is the following
    >>>
    >>> SLURM_JOBID=12714491
    >>> SLURM_JOB_CPUS_PER_NODE='4,2,5(x2),4,7,5'
    >>> SLURM_JOB_ID=12714491
    >>> SLURM_JOB_NODELIST='c1-[2,4,8,13,16,23,26]'
    >>> SLURM_JOB_NUM_NODES=7
    >>> SLURM_JOB_PARTITION=normal
    >>> SLURM_MEM_PER_CPU=2048
    >>> SLURM_NNODES=7
    >>> SLURM_NODELIST='c1-[2,4,8,13,16,23,26]'
    >>> SLURM_NODE_ALIASES='(null)'
    >>> SLURM_NPROCS=32
    >>> SLURM_NTASKS=32
    >>> SLURM_SUBMIT_DIR=/cluster/home/marcink
    >>> SLURM_SUBMIT_HOST=login-0-1.local
    >>> SLURM_TASKS_PER_NODE='4,2,5(x2),4,7,5'
    >>>
    >>> The output of the command you asked for is
    >>>
    >>> 0: c1-2.local  Cpus_allowed_list: 1-4,17-20
    >>> 1: c1-4.local  Cpus_allowed_list: 1,15,17,31
    >>> 2: c1-8.local  Cpus_allowed_list: 0,5,9,13-14,16,21,25,29-30
    >>> 3: c1-13.local  Cpus_allowed_list:  3-7,19-23
    >>> 4: c1-16.local  Cpus_allowed_list:  12-15,28-31
    >>> 5: c1-23.local  Cpus_allowed_list:  2-4,8,13-15,18-20,24,29-31
    >>> 6: c1-26.local  Cpus_allowed_list:  1,6,11,13,15,17,22,27,29,31
    >>>
    >>> Running with command
    >>>
    >>> mpirun --mca rmaps_base_verbose 10 --hetero-nodes --bind-to
    core --report-bindings --map-by socket -np 32 ./affinity
    >>>
    >>> I have attached two output files: one for the original
    1.10.1rc1, one for the patched version.
    >>>
    >>> When I said 'failed in one case' I was not precise. I got an
    error on node c1-8, which was the first one to have different
    number of MPI processes on the two sockets. It would also fail on
    some later nodes, just                 that because of the error
    we never got there.
    >>>
    >>> Let me know if you need more.
    >>>
    >>> Marcin
    >>>
    >>>
    >>>
    >>>
    >>>
    >>>
    >>>
    >>>> Cheers,
    >>>>
    >>>> Gilles
    >>>>
    >>>> On 10/4/2015 11:55 PM, marcin.krotkiewski wrote:
    >>>>> Hi, all,
    >>>>>
    >>>>> I played a bit more and it seems that the problem results from
    >>>>>
    >>>>> trg_obj = opal_hwloc_base_find_min_bound_target_under_obj()
    >>>>>
    >>>>> called in rmaps_base_binding.c / bind_downwards being
    wrong. I do not know the reason, but I think I know when the
    problem happens (at least on 1.10.1rc1). It seems that by default
    openmpi maps by socket. The error happens when for a given
    compute node there is a different number of cores used on each
    socket. Consider previously studied case (the debug outputs I
    sent in last post). c1-8, which was source of error, has 5 mpi
    processes assigned, and the cpuset is the following:
    >>>>>
    >>>>> 0, 5, 9, 13, 14, 16, 21, 25, 29, 30
    >>>>>
    >>>>> Cores 0,5 are on socket 0, cores 9, 13, 14 are on socket 1.
    Binding progresses correctly up to and including core 13 (see end
    of file out.1.10.1rc2, before the error). That is 2 cores on
    socket 0, and 2 cores on socket 1. Error is thrown when core 14
    should be bound - extra core on socket 1 with no corresponding
    core on socket 0. At that point the returned trg_obj points to
    the first core on the node (os_index 0, socket 0).
    >>>>>
    >>>>> I have submitted a few other jobs and I always had an error
    in such situation. Moreover, if I now use --map-by core instead
    of socket, the error is gone, and I get my expected binding:
    >>>>>
    >>>>> rank 0 @ compute-1-2.local  1, 17,
    >>>>> rank 1 @ compute-1-2.local  2, 18,
    >>>>> rank 2 @ compute-1-2.local  3, 19,
    >>>>> rank 3 @ compute-1-2.local  4, 20,
    >>>>> rank 4 @ compute-1-4.local  1, 17,
    >>>>> rank 5 @ compute-1-4.local  15, 31,
    >>>>> rank 6 @ compute-1-8.local  0, 16,
    >>>>> rank 7 @ compute-1-8.local  5, 21,
    >>>>> rank 8 @ compute-1-8.local  9, 25,
    >>>>> rank 9 @ compute-1-8.local  13, 29,
    >>>>> rank 10 @ compute-1-8.local  14, 30,
    >>>>> rank 11 @ compute-1-13.local  3, 19,
    >>>>> rank 12 @ compute-1-13.local  4, 20,
    >>>>> rank 13 @ compute-1-13.local  5, 21,
    >>>>> rank 14 @ compute-1-13.local  6, 22,
    >>>>> rank 15 @ compute-1-13.local  7, 23,
    >>>>> rank 16 @ compute-1-16.local  12, 28,
    >>>>> rank 17 @ compute-1-16.local  13, 29,
    >>>>> rank 18 @ compute-1-16.local  14, 30,
    >>>>> rank 19 @ compute-1-16.local  15, 31,
    >>>>> rank 20 @ compute-1-23.local  2, 18,
    >>>>> rank 29 @ compute-1-26.local  11, 27,
    >>>>> rank 21 @ compute-1-23.local  3, 19,
    >>>>> rank 30 @ compute-1-26.local  13, 29,
    >>>>> rank 22 @ compute-1-23.local  4, 20,
    >>>>> rank 31 @ compute-1-26.local  15, 31,
    >>>>> rank 23 @ compute-1-23.local  8, 24,
    >>>>> rank 27 @ compute-1-26.local  1, 17,
    >>>>> rank 24 @ compute-1-23.local  13, 29,
    >>>>> rank 28 @ compute-1-26.local  6, 22,
    >>>>> rank 25 @ compute-1-23.local  14, 30,
    >>>>> rank 26 @ compute-1-23.local  15, 31,
    >>>>>
    >>>>> Using --map-by core seems to fix the issue on 1.8.8, 1.10.0
    and 1.10.1rc1. However, there is still a difference in behavior
    between 1.10.1rc1 and earlier versions. In the SLURM job
    described in last post, 1.10.1rc1 fails to bind only in 1 case,
    while the earlier versions fail in 21 out of 32 cases. You
    mentioned there was a bug in hwloc. Not sure if it can explain
    the difference in behavior.
    >>>>>
    >>>>> Hope this helps to nail this down.
    >>>>>
    >>>>> Marcin
    >>>>>
    >>>>>
    >>>>>
    >>>>>
    >>>>> On 10/04/2015 09:55 AM, Gilles Gouaillardet wrote:
    >>>>>> Ralph,
    >>>>>>
    >>>>>> I suspect ompi tries to bind to threads outside the cpuset.
    >>>>>> this could be pretty similar to a previous issue when ompi
    tried to bind to cores outside the cpuset.
    >>>>>> /* when a core has more than one thread, would ompi assume
    all the threads are available if the core is available ? */
    >>>>>> I will investigate this from tomorrow
    >>>>>>
    >>>>>> Cheers,
    >>>>>>
    >>>>>> Gilles
    >>>>>>
    >>>>>> On Sunday, October 4, 2015, Ralph Castain
    <r...@open-mpi.org <javascript:;>> wrote:
    >>>>>> Thanks - please go ahead and release that allocation as
    I’m not going to get to this immediately. I’ve got several hot
    irons in the fire right now, and I’m not sure when I’ll get a
    chance to track this down.
    >>>>>>
    >>>>>> Gilles or anyone else who might have time - feel free to
    take a gander and see if something pops out at you.
    >>>>>>
    >>>>>> Ralph
    >>>>>>
    >>>>>>
    >>>>>>> On Oct 3, 2015, at 11:05 AM, marcin.krotkiewski
    <marcin.krotkiew...@gmail.com <javascript:;>> wrote:
    >>>>>>>
    >>>>>>>
    >>>>>>> Done. I have compiled 1.10.0 and 1.10.rc1 with
    --enable-debug and executed
    >>>>>>>
    >>>>>>> mpirun --mca rmaps_base_verbose 10 --hetero-nodes
    --report-bindings --bind-to core -np 32 ./affinity
    >>>>>>>
    >>>>>>> In case of 1.10.rc1 I have also added :overload-allowed -
    output in a separate file. This option did not make much
    difference for 1.10.0, so I did not attach it here.
    >>>>>>>
    >>>>>>> First thing I noted for 1.10.0 are lines like
    >>>>>>>
    >>>>>>> [login-0-1.local:03399] [[37945,0],0] GOT 1 CPUS
    >>>>>>> [login-0-1.local:03399] [[37945,0],0] PROC [[37945,1],27]
    BITMAP
    >>>>>>> [login-0-1.local:03399] [[37945,0],0] PROC [[37945,1],27]
    ON c1-26 IS NOT BOUND
    >>>>>>>
    >>>>>>> with an empty BITMAP.
    >>>>>>>
    >>>>>>> The SLURM environment is
    >>>>>>>
    >>>>>>> set | grep SLURM
    >>>>>>> SLURM_JOBID=12714491
    >>>>>>> SLURM_JOB_CPUS_PER_NODE='4,2,5(x2),4,7,5'
    >>>>>>> SLURM_JOB_ID=12714491
    >>>>>>> SLURM_JOB_NODELIST='c1-[2,4,8,13,16,23,26]'
    >>>>>>> SLURM_JOB_NUM_NODES=7
    >>>>>>> SLURM_JOB_PARTITION=normal
    >>>>>>> SLURM_MEM_PER_CPU=2048
    >>>>>>> SLURM_NNODES=7
    >>>>>>> SLURM_NODELIST='c1-[2,4,8,13,16,23,26]'
    >>>>>>> SLURM_NODE_ALIASES='(null)'
    >>>>>>> SLURM_NPROCS=32
    >>>>>>> SLURM_NTASKS=32
    >>>>>>> SLURM_SUBMIT_DIR=/cluster/home/marcink
    >>>>>>> SLURM_SUBMIT_HOST=login-0-1.local
    >>>>>>> SLURM_TASKS_PER_NODE='4,2,5(x2),4,7,5'
    >>>>>>>
    >>>>>>> I have submitted an interactive job on screen for 120
    hours now to work with one example, and not change it for every
    post :)
    >>>>>>>
    >>>>>>> If you need anything else, let me know. I could introduce
    some patch/printfs and recompile, if you need it.
    >>>>>>>
    >>>>>>> Marcin
    >>>>>>>
    >>>>>>>
    >>>>>>>
    >>>>>>> On 10/03/2015 07:17 PM, Ralph Castain wrote:
    >>>>>>>> Rats - just realized I have no way to test this as none
    of the machines I can access are setup for cgroup-based
    multi-tenant. Is this a debug version of OMPI? If not, can you
    rebuild OMPI with —enable-debug?
    >>>>>>>>
    >>>>>>>> Then please run it with —mca rmaps_base_verbose 10 and
    pass along the output.
    >>>>>>>>
    >>>>>>>> Thanks
    >>>>>>>> Ralph
    >>>>>>>>
    >>>>>>>>
    >>>>>>>>> On Oct 3, 2015, at 10:09 AM, Ralph Castain
    <r...@open-mpi.org <javascript:;>> wrote:
    >>>>>>>>>
    >>>>>>>>> What version of slurm is this? I might try to debug it
    here. I’m not sure where the problem lies just yet.
    >>>>>>>>>
    >>>>>>>>>
    >>>>>>>>>> On Oct 3, 2015, at 8:59 AM, marcin.krotkiewski
    <marcin.krotkiew...@gmail.com <javascript:;>> wrote:
    >>>>>>>>>>
    >>>>>>>>>> Here is the output of lstopo. In short, (0,16) are
    core 0, (1,17) - core 1 etc.
    >>>>>>>>>>
    >>>>>>>>>> Machine (64GB)
    >>>>>>>>>>   NUMANode L#0 (P#0 32GB)
    >>>>>>>>>>     Socket L#0 + L3 L#0 (20MB)
    >>>>>>>>>>       L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB)
    + Core L#0
    >>>>>>>>>>         PU L#0 (P#0)
    >>>>>>>>>>         PU L#1 (P#16)
    >>>>>>>>>>       L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB)
    + Core L#1
    >>>>>>>>>>         PU L#2 (P#1)
    >>>>>>>>>>         PU L#3 (P#17)
    >>>>>>>>>>       L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB)
    + Core L#2
    >>>>>>>>>>         PU L#4 (P#2)
    >>>>>>>>>>         PU L#5 (P#18)
    >>>>>>>>>>       L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB)
    + Core L#3
    >>>>>>>>>>         PU L#6 (P#3)
    >>>>>>>>>>         PU L#7 (P#19)
    >>>>>>>>>>       L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB)
    + Core L#4
    >>>>>>>>>>         PU L#8 (P#4)
    >>>>>>>>>>         PU L#9 (P#20)
    >>>>>>>>>>       L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB)
    + Core L#5
    >>>>>>>>>>         PU L#10 (P#5)
    >>>>>>>>>>         PU L#11 (P#21)
    >>>>>>>>>>       L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB)
    + Core L#6
    >>>>>>>>>>         PU L#12 (P#6)
    >>>>>>>>>>         PU L#13 (P#22)
    >>>>>>>>>>       L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB)
    + Core L#7
    >>>>>>>>>>         PU L#14 (P#7)
    >>>>>>>>>>         PU L#15 (P#23)
    >>>>>>>>>>  HostBridge L#0
    >>>>>>>>>>  PCIBridge
    >>>>>>>>>>         PCI 8086:1521
    >>>>>>>>>>           Net L#0 "eth0"
    >>>>>>>>>>         PCI 8086:1521
    >>>>>>>>>>           Net L#1 "eth1"
    >>>>>>>>>>  PCIBridge
    >>>>>>>>>>         PCI 15b3:1003
    >>>>>>>>>>           Net L#2 "ib0"
    >>>>>>>>>>  OpenFabrics L#3 "mlx4_0"
    >>>>>>>>>>  PCIBridge
    >>>>>>>>>>         PCI 102b:0532
    >>>>>>>>>>       PCI 8086:1d02
    >>>>>>>>>>         Block L#4 "sda"
    >>>>>>>>>>   NUMANode L#1 (P#1 32GB) + Socket L#1 + L3 L#1 (20MB)
    >>>>>>>>>>     L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) +
    Core L#8
    >>>>>>>>>>       PU L#16 (P#8)
    >>>>>>>>>>       PU L#17 (P#24)
    >>>>>>>>>>     L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) +
    Core L#9
    >>>>>>>>>>       PU L#18 (P#9)
    >>>>>>>>>>       PU L#19 (P#25)
    >>>>>>>>>>     L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10
    (32KB) + Core L#10
    >>>>>>>>>>       PU L#20 (P#10)
    >>>>>>>>>>       PU L#21 (P#26)
    >>>>>>>>>>     L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11
    (32KB) + Core L#11
    >>>>>>>>>>       PU L#22 (P#11)
    >>>>>>>>>>       PU L#23 (P#27)
    >>>>>>>>>>     L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12
    (32KB) + Core L#12
    >>>>>>>>>>       PU L#24 (P#12)
    >>>>>>>>>>       PU L#25 (P#28)
    >>>>>>>>>>     L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13
    (32KB) + Core L#13
    >>>>>>>>>>       PU L#26 (P#13)
    >>>>>>>>>>       PU L#27 (P#29)
    >>>>>>>>>>     L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14
    (32KB) + Core L#14
    >>>>>>>>>>       PU L#28 (P#14)
    >>>>>>>>>>       PU L#29 (P#30)
    >>>>>>>>>>     L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15
    (32KB) + Core L#15
    >>>>>>>>>>       PU L#30 (P#15)
    >>>>>>>>>>       PU L#31 (P#31)
    >>>>>>>>>>
    >>>>>>>>>>
    >>>>>>>>>>
    >>>>>>>>>> On 10/03/2015 05:46 PM, Ralph Castain wrote:
    >>>>>>>>>>> Maybe I’m just misreading your HT map - that slurm
    nodelist syntax is a new one to me, but they tend to change
    things around. Could you run lstopo on one of those compute nodes
    and send the output?
    >>>>>>>>>>>
    >>>>>>>>>>> I’m just suspicious because I’m not seeing a clear
    pairing of HT numbers in your output, but HT numbering is
    BIOS-specific and I may just not be understanding your particular
    pattern. Our error message is clearly indicating that we are
    seeing individual HTs (and not complete cores) assigned, and I
    don’t know the source of that confusion.
    >>>>>>>>>>>
    >>>>>>>>>>>
    >>>>>>>>>>>> On Oct 3, 2015, at 8:28 AM, marcin.krotkiewski
    <marcin.krotkiew...@gmail.com <javascript:;>> wrote:
    >>>>>>>>>>>>
    >>>>>>>>>>>>
    >>>>>>>>>>>> On 10/03/2015 04:38 PM, Ralph Castain wrote:
    >>>>>>>>>>>>> If mpirun isn’t trying to do any binding, then you
    will of course get the right mapping as we’ll just inherit
    whatever we received.
    >>>>>>>>>>>> Yes. I meant that whatever you received (what SLURM
    gives) is a correct cpu map and assigns _whole_ CPUs, not a
    single HT to MPI processes. In the case mentioned earlier openmpi
    should start 6 tasks on c1-30. If HT would be treated as separate
    and independent cores, sched_getaffinity of an MPI process
    started on c1-30 would return a map with 6 entries only. In my
    case it returns a map                  with 12 entries - 2 for
    each core. So one  process is in fact allocated both HTs, not
    only one. Is what I'm saying correct?
    >>>>>>>>>>>>
    >>>>>>>>>>>>> Looking at your output, it’s pretty clear that you
    are getting independent HTs assigned and not full cores.
    >>>>>>>>>>>> How do you mean? Is the above understanding wrong? I
    would expect that on c1-30 with --bind-to core openmpi should
    bind to logical cores 0 and 16 (rank 0), 1 and 17 (rank 2) and so
    on. All those logical cores are available in sched_getaffinity
    map, and there is twice as many logical cores as there are MPI
    processes started on the node.
    >>>>>>>>>>>>
    >>>>>>>>>>>>> My guess is that something in slurm has changed
    such that it detects that HT has been enabled, and then begins
    treating the HTs as completely independent cpus.
    >>>>>>>>>>>>>
    >>>>>>>>>>>>> Try changing “-bind-to core” to “-bind-to hwthread
    -use-hwthread-cpus” and see if that works
    >>>>>>>>>>>>>
    >>>>>>>>>>>> I have and the binding is wrong. For example, I got
    this output
    >>>>>>>>>>>>
    >>>>>>>>>>>> rank 0 @ compute-1-30.local  0,
    >>>>>>>>>>>> rank 1 @ compute-1-30.local  16,
    >>>>>>>>>>>>
    >>>>>>>>>>>> Which means that two ranks have been bound to the
    same physical core (logical cores 0 and 16 are two HTs of the
    same core). If I use --bind-to core, I get the following correct
    binding
    >>>>>>>>>>>>
    >>>>>>>>>>>> rank 0 @ compute-1-30.local  0, 16,
    >>>>>>>>>>>>
    >>>>>>>>>>>> The problem is many other ranks get bad binding with
    'rank XXX is not bound (or bound to all available processors)'
    warning.
    >>>>>>>>>>>>
    >>>>>>>>>>>> But I think I was not entirely correct saying that
    1.10.1rc1 did not fix things. It still might have improved
    something, but not everything. Consider this job:
    >>>>>>>>>>>>
    >>>>>>>>>>>> SLURM_JOB_CPUS_PER_NODE='5,4,6,5(x2),7,5,9,5,7,6'
    >>>>>>>>>>>>
    SLURM_JOB_NODELIST='c8-[31,34],c9-[30-32,35-36],c10-[31-34]'
    >>>>>>>>>>>>
    >>>>>>>>>>>> If I run 32 tasks as follows (with 1.10.1rc1)
    >>>>>>>>>>>>
    >>>>>>>>>>>> mpirun --hetero-nodes --report-bindings --bind-to
    core -np 32 ./affinity
    >>>>>>>>>>>>
    >>>>>>>>>>>> I get the following error:
    >>>>>>>>>>>>
    >>>>>>>>>>>>
    --------------------------------------------------------------------------
    >>>>>>>>>>>> A request was made to bind to that would result in
    binding more
    >>>>>>>>>>>> processes than cpus on a resource:
    >>>>>>>>>>>>
    >>>>>>>>>>>> Bind to:     CORE
    >>>>>>>>>>>> Node:        c9-31
    >>>>>>>>>>>> #processes:  2
    >>>>>>>>>>>> #cpus:       1
    >>>>>>>>>>>>
    >>>>>>>>>>>> You can override this protection by adding the
    "overload-allowed"
    >>>>>>>>>>>> option to your binding directive.
    >>>>>>>>>>>>
    --------------------------------------------------------------------------
    >>>>>>>>>>>>
    >>>>>>>>>>>>
    >>>>>>>>>>>> If I now use --bind-to core:overload-allowed, then
    openmpi starts and _most_ of the threads are bound correctly
    (i.e., map contains two logical cores in ALL cases), except this
    case that required the overload flag:
    >>>>>>>>>>>>
    >>>>>>>>>>>> rank 15 @ compute-9-31.local   1, 17,
    >>>>>>>>>>>> rank 16 @ compute-9-31.local  11, 27,
    >>>>>>>>>>>> rank 17 @ compute-9-31.local   2, 18,
    >>>>>>>>>>>> rank 18 @ compute-9-31.local  12, 28,
    >>>>>>>>>>>> rank 19 @ compute-9-31.local   1, 17,
    >>>>>>>>>>>>
    >>>>>>>>>>>> Note pair 1,17 is used twice. The original SLURM
    delivered map (no binding) on this node is
    >>>>>>>>>>>>
    >>>>>>>>>>>> rank 15 @ compute-9-31.local  1, 2, 11, 12, 13, 17,
    18, 27, 28, 29,
    >>>>>>>>>>>> rank 16 @ compute-9-31.local  1, 2, 11, 12, 13, 17,
    18, 27, 28, 29,
    >>>>>>>>>>>> rank 17 @ compute-9-31.local  1, 2, 11, 12, 13, 17,
    18, 27, 28, 29,
    >>>>>>>>>>>> rank 18 @ compute-9-31.local  1, 2, 11, 12, 13, 17,
    18, 27, 28, 29,
    >>>>>>>>>>>> rank 19 @ compute-9-31.local  1, 2, 11, 12, 13, 17,
    18, 27, 28, 29,
    >>>>>>>>>>>>
    >>>>>>>>>>>> Why does openmpi use cores (1,17) twice instead of
    using core (13,29)? Clearly, the original SLURM-delivered map has
    5 CPUs included, enough for 5 MPI processes.
    >>>>>>>>>>>>
    >>>>>>>>>>>> Cheers,
    >>>>>>>>>>>>
    >>>>>>>>>>>> Marcin
    >>>>>>>>>>>>
    >>>>>>>>>>>>
    >>>>>>>>>>>>>
    >>>>>>>>>>>>>> On Oct 3, 2015, at 7:12 AM, marcin.krotkiewski
    <marcin.krotkiew...@gmail.com <javascript:;>> wrote:
    >>>>>>>>>>>>>>
    >>>>>>>>>>>>>>
    >>>>>>>>>>>>>> On 10/03/2015 01:06 PM, Ralph Castain wrote:
    >>>>>>>>>>>>>>> Thanks Marcin. Looking at this, I’m guessing that
    Slurm may be treating HTs as “cores” - i.e., as independent cpus.
    Any chance that is true?
    >>>>>>>>>>>>>> Not to the best of my knowledge, and at least not
    intentionally. SLURM starts as many processes as there are
    physical cores, not threads. To verify this, consider this test case:
    >>>>>>
    >>>>>>
    >>>>>> _______________________________________________
    >>>>>> users mailing list
    >>>>>>
    >>>>>> us...@open-mpi.org <javascript:;>
    >>>>>>
    >>>>>> Subscription:
    >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
    >>>>>>
    >>>>>> Link to this post:
    >>>>>>
    http://www.open-mpi.org/community/lists/users/2015/10/27790.php
    >>>>>
    >>>>>
    >>>>>
    >>>>> _______________________________________________
    >>>>> users mailing list
    >>>>>
    >>>>> us...@open-mpi.org <javascript:;>
    >>>>>
    >>>>> Subscription:
    >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
    >>>>>
    >>>>> Link to this post:
    >>>>> http://www.open-mpi.org/community/lists/users/2015/10/27791.php
    >>>>
    >>>>
    >>>>
    >>>> _______________________________________________
    >>>> users mailing list
    >>>>
    >>>> us...@open-mpi.org <javascript:;>
    >>>>
    >>>> Subscription:
    >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
    >>>>
    >>>> Link to this post:
    >>>> http://www.open-mpi.org/community/lists/users/2015/10/27792.php
    >>>
    >>>
    >>>
    >>> _______________________________________________
    >>> users mailing list
    >>>
    >>> us...@open-mpi.org <javascript:;>
    >>>
    >>> Subscription:
    >>> http://www.open-mpi.org/mailman/listinfo.cgi/users
    >>>
    >>> Link to this post:
    >>> http://www.open-mpi.org/community/lists/users/2015/10/27814.php
    >>
    >>
    >>
    >> _______________________________________________
    >> users mailing list
    >>
    >> us...@open-mpi.org <javascript:;>
    >>
    >> Subscription:
    >> http://www.open-mpi.org/mailman/listinfo.cgi/users
    >>
    >> Link to this post:
    >> http://www.open-mpi.org/community/lists/users/2015/10/27815.php
    >
    >
    
<heterogeneous_topologies.patch>_______________________________________________
    > users mailing list
    > us...@open-mpi.org <javascript:;>
    > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
    > Link to this post:
    http://www.open-mpi.org/community/lists/users/2015/10/27827.php


    --
    Jeff Squyres
    jsquy...@cisco.com <javascript:;>
    For corporate legal information go to:
    http://www.cisco.com/web/about/doing_business/legal/cri/

    _______________________________________________
    users mailing list
    us...@open-mpi.org <javascript:;>
    Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
    Link to this post:
    http://www.open-mpi.org/community/lists/users/2015/10/27828.php

_______________________________________________
users mailing list
us...@open-mpi.org <mailto:us...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-mpi.org/community/lists/users/2015/10/27830.php



_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/10/27834.php

Reply via email to