Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

marcin.krotkiewski Sat, 3 Oct 2015 11:59:31 -0400 (EDT)

Here is the output of lstopo. In short, (0,16) are core 0, (1,17) - core1 etc.


Machine (64GB)
  NUMANode L#0 (P#0 32GB)
    Socket L#0 + L3 L#0 (20MB)
      L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
        PU L#0 (P#0)
        PU L#1 (P#16)
      L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
        PU L#2 (P#1)
        PU L#3 (P#17)
      L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
        PU L#4 (P#2)
        PU L#5 (P#18)
      L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
        PU L#6 (P#3)
        PU L#7 (P#19)
      L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
        PU L#8 (P#4)
        PU L#9 (P#20)
      L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
        PU L#10 (P#5)
        PU L#11 (P#21)
      L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6
        PU L#12 (P#6)
        PU L#13 (P#22)
      L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7
        PU L#14 (P#7)
        PU L#15 (P#23)
    HostBridge L#0
      PCIBridge
        PCI 8086:1521
          Net L#0 "eth0"
        PCI 8086:1521
          Net L#1 "eth1"
      PCIBridge
        PCI 15b3:1003
          Net L#2 "ib0"
          OpenFabrics L#3 "mlx4_0"
      PCIBridge
        PCI 102b:0532
      PCI 8086:1d02
        Block L#4 "sda"
  NUMANode L#1 (P#1 32GB) + Socket L#1 + L3 L#1 (20MB)
    L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8
      PU L#16 (P#8)
      PU L#17 (P#24)
    L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9
      PU L#18 (P#9)
      PU L#19 (P#25)
    L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10
      PU L#20 (P#10)
      PU L#21 (P#26)
    L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11
      PU L#22 (P#11)
      PU L#23 (P#27)
    L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12
      PU L#24 (P#12)
      PU L#25 (P#28)
    L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13
      PU L#26 (P#13)
      PU L#27 (P#29)
    L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14
      PU L#28 (P#14)
      PU L#29 (P#30)
    L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15
      PU L#30 (P#15)
      PU L#31 (P#31)




On 10/03/2015 05:46 PM, Ralph Castain wrote:

Maybe I’m just misreading your HT map - that slurm nodelist syntax isa new one to me, but they tend to change things around. Could you runlstopo on one of those compute nodes and send the output?
I’m just suspicious because I’m not seeing a clear pairing of HTnumbers in your output, but HT numbering is BIOS-specific and I mayjust not be understanding your particular pattern. Our error messageis clearly indicating that we are seeing individual HTs (and notcomplete cores) assigned, and I don’t know the source of that confusion.
On Oct 3, 2015, at 8:28 AM, marcin.krotkiewski<marcin.krotkiew...@gmail.com <mailto:marcin.krotkiew...@gmail.com>>wrote:
On 10/03/2015 04:38 PM, Ralph Castain wrote:
If mpirun isn’t trying to do any binding, then you will of courseget the right mapping as we’ll just inherit whatever we received.
Yes. I meant that whatever you received (what SLURM gives) is acorrect cpu map and assigns _whole_ CPUs, not a single HT to MPIprocesses. In the case mentioned earlier openmpi should start 6 taskson c1-30. If HT would be treated as separate and independent cores,sched_getaffinity of an MPI process started on c1-30 would return amap with 6 entries only. In my case it returns a map with 12 entries- 2 for each core. So one process is in fact allocated both HTs, notonly one. Is what I'm saying correct?
Looking at your output, it’s pretty clear that you are gettingindependent HTs assigned and not full cores.
How do you mean? Is the above understanding wrong? I would expectthat on c1-30 with --bind-to core openmpi should bind to logicalcores 0 and 16 (rank 0), 1 and 17 (rank 2) and so on. All thoselogical cores are available in sched_getaffinity map, and there istwice as many logical cores as there are MPI processes started on thenode.
My guess is that something in slurm has changed such that it detectsthat HT has been enabled, and then begins treating the HTs ascompletely independent cpus.
Try changing “-bind-to core” to “-bind-to hwthread-use-hwthread-cpus” and see if that works
I have and the binding is wrong. For example, I got this output

rank 0 @ compute-1-30.local  0,
rank 1 @ compute-1-30.local  16,
Which means that two ranks have been bound to the same physical core(logical cores 0 and 16 are two HTs of the same core). If I use--bind-to core, I get the following correct binding
rank 0 @ compute-1-30.local  0, 16,
The problem is many other ranks get bad binding with 'rank XXX is notbound (or bound to all available processors)' warning.
But I think I was not entirely correct saying that 1.10.1rc1 did notfix things. It still might have improved something, but noteverything. Consider this job:
SLURM_JOB_CPUS_PER_NODE='5,4,6,5(x2),7,5,9,5,7,6'
SLURM_JOB_NODELIST='c8-[31,34],c9-[30-32,35-36],c10-[31-34]'

If I run 32 tasks as follows (with 1.10.1rc1)

mpirun --hetero-nodes --report-bindings --bind-to core -np 32 ./affinity

I get the following error:

--------------------------------------------------------------------------
A request was made to bind to that would result in binding more
processes than cpus on a resource:

   Bind to:     CORE
   Node:        c9-31
   #processes:  2
   #cpus:       1

You can override this protection by adding the "overload-allowed"
option to your binding directive.
--------------------------------------------------------------------------
If I now use --bind-to core:overload-allowed, then openmpi starts and_most_ of the threads are bound correctly (i.e., map contains twological cores in ALL cases), except this case that required theoverload flag:
rank 15 @ compute-9-31.local   1, 17,
rank 16 @ compute-9-31.local  11, 27,
rank 17 @ compute-9-31.local   2, 18,
rank 18 @ compute-9-31.local  12, 28,
rank 19 @ compute-9-31.local   1, 17,
Note pair 1,17 is used twice. The original SLURM delivered map (nobinding) on this node is
rank 15 @ compute-9-31.local  1, 2, 11, 12, 13, 17, 18, 27, 28, 29,
rank 16 @ compute-9-31.local  1, 2, 11, 12, 13, 17, 18, 27, 28, 29,
rank 17 @ compute-9-31.local  1, 2, 11, 12, 13, 17, 18, 27, 28, 29,
rank 18 @ compute-9-31.local  1, 2, 11, 12, 13, 17, 18, 27, 28, 29,
rank 19 @ compute-9-31.local  1, 2, 11, 12, 13, 17, 18, 27, 28, 29,
Why does openmpi use cores (1,17) twice instead of using core(13,29)? Clearly, the original SLURM-delivered map has 5 CPUsincluded, enough for 5 MPI processes.
Cheers,

Marcin
On Oct 3, 2015, at 7:12 AM, marcin.krotkiewski<marcin.krotkiew...@gmail.com<mailto:marcin.krotkiew...@gmail.com>> wrote:
On 10/03/2015 01:06 PM, Ralph Castain wrote:
Thanks Marcin. Looking at this, I’m guessing that Slurm may betreating HTs as “cores” - i.e., as independent cpus. Any chancethat is true?
Not to the best of my knowledge, and at least not intentionally.SLURM starts as many processes as there are physical cores, notthreads. To verify this, consider this test case:
SLURM_JOB_CPUS_PER_NODE='6,8(x2),10'
SLURM_JOB_NODELIST='c1-[30-31],c2-[32,34]'
If I now execute only one mpi process WITH NO BINDING, it will goonto c1-30 and should have a map with 6 CPUs (12 hw threads). I run
mpirun --bind-to none -np 1 ./affinity
rank 0 @ compute-1-30.local  0, 1, 3, 4, 5, 6, 16, 17, 19, 20, 21, 22,
I have attached the affinity.c program FYI. Clearly,sched_getaffinity in my test code returns the correct map.
Now if I try to start all 32 processes in this example (still nobinding):
rank 0 @ compute-1-30.local  0, 1, 3, 4, 5, 6, 16, 17, 19, 20, 21, 22,
rank 1 @ compute-1-30.local  0, 1, 3, 4, 5, 6, 16, 17, 19, 20, 21, 22,
rank 10 @ compute-1-31.local 2, 3, 7, 11, 12, 13, 14, 15, 18, 19,23, 27, 28, 29, 30, 31,rank 11 @ compute-1-31.local 2, 3, 7, 11, 12, 13, 14, 15, 18, 19,23, 27, 28, 29, 30, 31,rank 12 @ compute-1-31.local 2, 3, 7, 11, 12, 13, 14, 15, 18, 19,23, 27, 28, 29, 30, 31,rank 13 @ compute-1-31.local 2, 3, 7, 11, 12, 13, 14, 15, 18, 19,23, 27, 28, 29, 30, 31,rank 6 @ compute-1-31.local 2, 3, 7, 11, 12, 13, 14, 15, 18, 19,23, 27, 28, 29, 30, 31,
rank 2 @ compute-1-30.local  0, 1, 3, 4, 5, 6, 16, 17, 19, 20, 21, 22,
rank 7 @ compute-1-31.local 2, 3, 7, 11, 12, 13, 14, 15, 18, 19,23, 27, 28, 29, 30, 31,rank 8 @ compute-1-31.local 2, 3, 7, 11, 12, 13, 14, 15, 18, 19,23, 27, 28, 29, 30, 31,
rank 3 @ compute-1-30.local  0, 1, 3, 4, 5, 6, 16, 17, 19, 20, 21, 22,
rank 14 @ compute-2-32.local 7, 8, 9, 10, 11, 12, 13, 14, 23, 24,25, 26, 27, 28, 29, 30,
rank 4 @ compute-1-30.local  0, 1, 3, 4, 5, 6, 16, 17, 19, 20, 21, 22,
rank 15 @ compute-2-32.local 7, 8, 9, 10, 11, 12, 13, 14, 23, 24,25, 26, 27, 28, 29, 30,rank 9 @ compute-1-31.local 2, 3, 7, 11, 12, 13, 14, 15, 18, 19,23, 27, 28, 29, 30, 31,
rank 5 @ compute-1-30.local  0, 1, 3, 4, 5, 6, 16, 17, 19, 20, 21, 22,
rank 16 @ compute-2-32.local 7, 8, 9, 10, 11, 12, 13, 14, 23, 24,25, 26, 27, 28, 29, 30,rank 17 @ compute-2-32.local 7, 8, 9, 10, 11, 12, 13, 14, 23, 24,25, 26, 27, 28, 29, 30,rank 29 @ compute-2-34.local 0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16,17, 18, 19, 20, 21, 22, 23, 30, 31,rank 30 @ compute-2-34.local 0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16,17, 18, 19, 20, 21, 22, 23, 30, 31,rank 18 @ compute-2-32.local 7, 8, 9, 10, 11, 12, 13, 14, 23, 24,25, 26, 27, 28, 29, 30,rank 19 @ compute-2-32.local 7, 8, 9, 10, 11, 12, 13, 14, 23, 24,25, 26, 27, 28, 29, 30,rank 31 @ compute-2-34.local 0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16,17, 18, 19, 20, 21, 22, 23, 30, 31,rank 20 @ compute-2-32.local 7, 8, 9, 10, 11, 12, 13, 14, 23, 24,25, 26, 27, 28, 29, 30,rank 22 @ compute-2-34.local 0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16,17, 18, 19, 20, 21, 22, 23, 30, 31,rank 21 @ compute-2-32.local 7, 8, 9, 10, 11, 12, 13, 14, 23, 24,25, 26, 27, 28, 29, 30,rank 23 @ compute-2-34.local 0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16,17, 18, 19, 20, 21, 22, 23, 30, 31,rank 24 @ compute-2-34.local 0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16,17, 18, 19, 20, 21, 22, 23, 30, 31,rank 25 @ compute-2-34.local 0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16,17, 18, 19, 20, 21, 22, 23, 30, 31,rank 26 @ compute-2-34.local 0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16,17, 18, 19, 20, 21, 22, 23, 30, 31,rank 27 @ compute-2-34.local 0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16,17, 18, 19, 20, 21, 22, 23, 30, 31,rank 28 @ compute-2-34.local 0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16,17, 18, 19, 20, 21, 22, 23, 30, 31,
Still looks ok to me. If I now turn the binding on, openmpi fails:


--------------------------------------------------------------------------
A request was made to bind to that would result in binding more
processes than cpus on a resource:

  Bind to:     CORE
  Node:        c1-31
  #processes:  2
  #cpus:       1

You can override this protection by adding the "overload-allowed"
option to your binding directive.
--------------------------------------------------------------------------
The above tests were done with 1.10.1rc1, so it does not fix theproblem.
Marcin
I’m wondering because bind-to core will attempt to bind your procto both HTs on the core. For some reason, we thought that 8.24were HTs on the same core, which is why we tried to bind to thatpair of HTs. We got an error because HT #24 was not allocated tous on node c6, but HT #8 was.
On Oct 3, 2015, at 2:43 AM, marcin.krotkiewski<marcin.krotkiew...@gmail.com<mailto:marcin.krotkiew...@gmail.com>> wrote:
Hi, Ralph,

I submit my slurm job as follows

salloc --ntasks=64 --mem-per-cpu=2G --time=1:0:0
Effectively, the allocated CPU cores are spread amount manycluster nodes. SLURM uses cgroups to limit the CPU coresavailable for mpi processes running on a given cluster node.Compute nodes are 2-socket, 8-core E5-2670 systems withHyperThreading on
node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31
node distances:
node   0   1
 0:  10  21
 1:  21  10

I run MPI program with command

mpirun  --report-bindings --bind-to core -np 64 ./affinity
The program simply runs sched_getaffinity for each process andprints out the result.
-----------
TEST RUN 1
-----------
For this particular job the problem is more severe: openmpi failsto run at all with error
--------------------------------------------------------------------------
Open MPI tried to bind a new process, but something went wrong.  The
process was killed without launching the target application.Your job
will now abort.

 Local host:        c6-6
 Application name:  ./affinity
Error message: hwloc_set_cpubind returned "Error" for bitmap"8,24"
 Location:          odls_default_module.c:551
--------------------------------------------------------------------------

This is SLURM environment variables:

SLURM_JOBID=12712225
SLURM_JOB_CPUS_PER_NODE='3(x2),2,1(x3),2(x2),1,3(x3),5,1,4,1,3,2,3,7,1,5,6,1'
SLURM_JOB_ID=12712225
SLURM_JOB_NODELIST='c6-[3,6-8,12,14,17,22-23],c8-[4,7,9,17,20,28],c15-[5,10,18,20,22-24,28],c16-11'
SLURM_JOB_NUM_NODES=24
SLURM_JOB_PARTITION=normal
SLURM_MEM_PER_CPU=2048
SLURM_NNODES=24
SLURM_NODELIST='c6-[3,6-8,12,14,17,22-23],c8-[4,7,9,17,20,28],c15-[5,10,18,20,22-24,28],c16-11'
SLURM_NODE_ALIASES='(null)'
SLURM_NPROCS=64
SLURM_NTASKS=64
SLURM_SUBMIT_DIR=/cluster/home/marcink
SLURM_SUBMIT_HOST=login-0-2.local
SLURM_TASKS_PER_NODE='3(x2),2,1(x3),2(x2),1,3(x3),5,1,4,1,3,2,3,7,1,5,6,1'

There is also a lot of warnings like
[compute-6-6.local:20158] MCW rank 4 is not bound (or bound toall available processors)
-----------
TEST RUN 2
-----------

In another allocation I got a different error

--------------------------------------------------------------------------
A request was made to bind to that would result in binding more
processes than cpus on a resource:

  Bind to:     CORE
  Node:        c6-19
  #processes:  2
  #cpus:       1

You can override this protection by adding the "overload-allowed"
option to your binding directive.
--------------------------------------------------------------------------

and the allocation was the following

SLURM_JOBID=12712250
SLURM_JOB_CPUS_PER_NODE='3(x2),2,1,15,1,3,16,2,1,3(x2),2,5,4'
SLURM_JOB_ID=12712250
SLURM_JOB_NODELIST='c6-[3,6-8,12,14,17,19,22-23],c8-[4,7,9,17,28]'
SLURM_JOB_NUM_NODES=15
SLURM_JOB_PARTITION=normal
SLURM_MEM_PER_CPU=2048
SLURM_NNODES=15
SLURM_NODELIST='c6-[3,6-8,12,14,17,19,22-23],c8-[4,7,9,17,28]'
SLURM_NODE_ALIASES='(null)'
SLURM_NPROCS=64
SLURM_NTASKS=64
SLURM_SUBMIT_DIR=/cluster/home/marcink
SLURM_SUBMIT_HOST=login-0-2.local
SLURM_TASKS_PER_NODE='3(x2),2,1,15,1,3,16,2,1,3(x2),2,5,4'


If in this case I run on only 32 cores

mpirun  --report-bindings --bind-to core -np 32 ./affinity

the process starts, but I get the original binding problem:
[compute-6-8.local:31414] MCW rank 8 is not bound (or bound toall available processors)
Running with --hetero-nodes yields exactly the same results
Hope the above is useful. The problem with binding under SLURMwith CPU cores spread over nodes seems to be very reproducible.It is actually very often that OpenMPI dies with some error likeabove. These tests were run with openmpi-1.8.8 and 1.10.0, bothgiving same results.
One more suggestion. The warning message (MCW rank 8 is notbound...) is ONLY displayed when I use --report-bindings. It isnever shown if I leave out this option, and although the bindingis wrong the user is not notified. I think it would be better toshow this warning in all cases binding fails.
Let me know if you need more information. I can help to debugthis - it is a rather crucial issue.
Thanks!

Marcin






On 10/02/2015 11:49 PM, Ralph Castain wrote:
Can you please send me the allocation request you made (so I cansee what you specified on the cmd line), and the mpirun cmd line?
Thanks
Ralph
On Oct 2, 2015, at 8:25 AM, Marcin Krotkiewski<marcin.krotkiew...@gmail.com> wrote:
Hi,
I fail to make OpenMPI bind to cores correctly when runningfrom within SLURM-allocated CPU resources spread over a rangeof compute nodes in an otherwise homogeneous cluster. I havefound this thread
http://www.open-mpi.org/community/lists/users/2014/06/24682.php
and did try to use what Ralph suggested there (--hetero-nodes),but it does not work (v. 1.10.0). When running with--report-bindings I get messages like
[compute-9-11.local:27571] MCW rank 10 is not bound (or boundto all available processors)
for all ranks outside of my first physical compute node.Moreover, everything works as expected if I ask SLURM to assignentire compute nodes. So it does look like Ralph's diagnosepresented in that thread is correct, just the --hetero-nodesswitch does not work for me.
I have written a short code that uses sched_getaffinity toprint the effective bindings: all MPI ranks except of those onthe first node are bound to all CPU cores allocated by SLURM.
Do I have to do something except of --hetero-nodes, or is thisa problem that needs further investigation?
Thanks a lot!

Marcin

_______________________________________________
users mailing list
us...@open-mpi.org
Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to thispost:http://www.open-mpi.org/community/lists/users/2015/10/27770.php
_______________________________________________
users mailing list
us...@open-mpi.org <mailto:us...@open-mpi.org>
Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to thispost:http://www.open-mpi.org/community/lists/users/2015/10/27774.php
_______________________________________________
users mailing list
us...@open-mpi.org <mailto:us...@open-mpi.org>
Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to thispost:http://www.open-mpi.org/community/lists/users/2015/10/27776.php
_______________________________________________
users mailing list
us...@open-mpi.org <mailto:us...@open-mpi.org>
Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to thispost:http://www.open-mpi.org/community/lists/users/2015/10/27778.php
<affinity.c>_______________________________________________
users mailing list
us...@open-mpi.org <mailto:us...@open-mpi.org>
Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to thispost:http://www.open-mpi.org/community/lists/users/2015/10/27781.php
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this 
post:http://www.open-mpi.org/community/lists/users/2015/10/27782.php
_______________________________________________
users mailing list
us...@open-mpi.org <mailto:us...@open-mpi.org>
Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to thispost:http://www.open-mpi.org/community/lists/users/2015/10/27783.php
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/10/27784.php

Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

Reply via email to