Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

marcin.krotkiewski Sat, 3 Oct 2015 10:12:50 -0400 (EDT)


On 10/03/2015 01:06 PM, Ralph Castain wrote:

Thanks Marcin. Looking at this, I’m guessing that Slurm may be treating HTs as 
“cores” - i.e., as independent cpus. Any chance that is true?

Not to the best of my knowledge, and at least not intentionally. SLURMstarts as many processes as there are physical cores, not threads. Toverify this, consider this test case:


SLURM_JOB_CPUS_PER_NODE='6,8(x2),10'
SLURM_JOB_NODELIST='c1-[30-31],c2-[32,34]'

If I now execute only one mpi process WITH NO BINDING, it will go ontoc1-30 and should have a map with 6 CPUs (12 hw threads). I run


mpirun --bind-to none -np 1 ./affinity
rank 0 @ compute-1-30.local  0, 1, 3, 4, 5, 6, 16, 17, 19, 20, 21, 22,

I have attached the affinity.c program FYI. Clearly, sched_getaffinityin my test code returns the correct map.


Now if I try to start all 32 processes in this example (still no binding):

rank 0 @ compute-1-30.local  0, 1, 3, 4, 5, 6, 16, 17, 19, 20, 21, 22,
rank 1 @ compute-1-30.local  0, 1, 3, 4, 5, 6, 16, 17, 19, 20, 21, 22,

rank 10 @ compute-1-31.local 2, 3, 7, 11, 12, 13, 14, 15, 18, 19, 23,27, 28, 29, 30, 31,rank 11 @ compute-1-31.local 2, 3, 7, 11, 12, 13, 14, 15, 18, 19, 23,27, 28, 29, 30, 31,rank 12 @ compute-1-31.local 2, 3, 7, 11, 12, 13, 14, 15, 18, 19, 23,27, 28, 29, 30, 31,rank 13 @ compute-1-31.local 2, 3, 7, 11, 12, 13, 14, 15, 18, 19, 23,27, 28, 29, 30, 31,rank 6 @ compute-1-31.local 2, 3, 7, 11, 12, 13, 14, 15, 18, 19, 23,27, 28, 29, 30, 31,

rank 2 @ compute-1-30.local  0, 1, 3, 4, 5, 6, 16, 17, 19, 20, 21, 22,

rank 7 @ compute-1-31.local 2, 3, 7, 11, 12, 13, 14, 15, 18, 19, 23,27, 28, 29, 30, 31,rank 8 @ compute-1-31.local 2, 3, 7, 11, 12, 13, 14, 15, 18, 19, 23,27, 28, 29, 30, 31,

rank 3 @ compute-1-30.local  0, 1, 3, 4, 5, 6, 16, 17, 19, 20, 21, 22,

rank 14 @ compute-2-32.local 7, 8, 9, 10, 11, 12, 13, 14, 23, 24, 25,26, 27, 28, 29, 30,

rank 4 @ compute-1-30.local  0, 1, 3, 4, 5, 6, 16, 17, 19, 20, 21, 22,

rank 15 @ compute-2-32.local 7, 8, 9, 10, 11, 12, 13, 14, 23, 24, 25,26, 27, 28, 29, 30,rank 9 @ compute-1-31.local 2, 3, 7, 11, 12, 13, 14, 15, 18, 19, 23,27, 28, 29, 30, 31,

rank 5 @ compute-1-30.local  0, 1, 3, 4, 5, 6, 16, 17, 19, 20, 21, 22,

rank 16 @ compute-2-32.local 7, 8, 9, 10, 11, 12, 13, 14, 23, 24, 25,26, 27, 28, 29, 30,rank 17 @ compute-2-32.local 7, 8, 9, 10, 11, 12, 13, 14, 23, 24, 25,26, 27, 28, 29, 30,rank 29 @ compute-2-34.local 0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17,18, 19, 20, 21, 22, 23, 30, 31,rank 30 @ compute-2-34.local 0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17,18, 19, 20, 21, 22, 23, 30, 31,rank 18 @ compute-2-32.local 7, 8, 9, 10, 11, 12, 13, 14, 23, 24, 25,26, 27, 28, 29, 30,rank 19 @ compute-2-32.local 7, 8, 9, 10, 11, 12, 13, 14, 23, 24, 25,26, 27, 28, 29, 30,rank 31 @ compute-2-34.local 0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17,18, 19, 20, 21, 22, 23, 30, 31,rank 20 @ compute-2-32.local 7, 8, 9, 10, 11, 12, 13, 14, 23, 24, 25,26, 27, 28, 29, 30,rank 22 @ compute-2-34.local 0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17,18, 19, 20, 21, 22, 23, 30, 31,rank 21 @ compute-2-32.local 7, 8, 9, 10, 11, 12, 13, 14, 23, 24, 25,26, 27, 28, 29, 30,rank 23 @ compute-2-34.local 0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17,18, 19, 20, 21, 22, 23, 30, 31,rank 24 @ compute-2-34.local 0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17,18, 19, 20, 21, 22, 23, 30, 31,rank 25 @ compute-2-34.local 0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17,18, 19, 20, 21, 22, 23, 30, 31,rank 26 @ compute-2-34.local 0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17,18, 19, 20, 21, 22, 23, 30, 31,rank 27 @ compute-2-34.local 0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17,18, 19, 20, 21, 22, 23, 30, 31,rank 28 @ compute-2-34.local 0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17,18, 19, 20, 21, 22, 23, 30, 31,



Still looks ok to me. If I now turn the binding on, openmpi fails:


--------------------------------------------------------------------------
A request was made to bind to that would result in binding more
processes than cpus on a resource:

   Bind to:     CORE
   Node:        c1-31
   #processes:  2
   #cpus:       1

You can override this protection by adding the "overload-allowed"
option to your binding directive.
--------------------------------------------------------------------------

The above tests were done with 1.10.1rc1, so it does not fix the problem.

Marcin

I’m wondering because bind-to core will attempt to bind your proc to both HTs 
on the core. For some reason, we thought that 8.24 were HTs on the same core, 
which is why we tried to bind to that pair of HTs. We got an error because HT 
#24 was not allocated to us on node c6, but HT #8 was.

On Oct 3, 2015, at 2:43 AM, marcin.krotkiewski <marcin.krotkiew...@gmail.com> 
wrote:

Hi, Ralph,

I submit my slurm job as follows

salloc --ntasks=64 --mem-per-cpu=2G --time=1:0:0

Effectively, the allocated CPU cores are spread amount many cluster nodes. 
SLURM uses cgroups to limit the CPU cores available for mpi processes running 
on a given cluster node. Compute nodes are 2-socket, 8-core E5-2670 systems 
with HyperThreading on

node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31
node distances:
node   0   1
  0:  10  21
  1:  21  10

I run MPI program with command

mpirun  --report-bindings --bind-to core -np 64 ./affinity

The program simply runs sched_getaffinity for each process and prints out the 
result.

-----------
TEST RUN 1
-----------
For this particular job the problem is more severe: openmpi fails to run at all 
with error

--------------------------------------------------------------------------
Open MPI tried to bind a new process, but something went wrong.  The
process was killed without launching the target application.  Your job
will now abort.

  Local host:        c6-6
  Application name:  ./affinity
  Error message:     hwloc_set_cpubind returned "Error" for bitmap "8,24"
  Location:          odls_default_module.c:551
--------------------------------------------------------------------------

This is SLURM environment variables:

SLURM_JOBID=12712225
SLURM_JOB_CPUS_PER_NODE='3(x2),2,1(x3),2(x2),1,3(x3),5,1,4,1,3,2,3,7,1,5,6,1'
SLURM_JOB_ID=12712225
SLURM_JOB_NODELIST='c6-[3,6-8,12,14,17,22-23],c8-[4,7,9,17,20,28],c15-[5,10,18,20,22-24,28],c16-11'
SLURM_JOB_NUM_NODES=24
SLURM_JOB_PARTITION=normal
SLURM_MEM_PER_CPU=2048
SLURM_NNODES=24
SLURM_NODELIST='c6-[3,6-8,12,14,17,22-23],c8-[4,7,9,17,20,28],c15-[5,10,18,20,22-24,28],c16-11'
SLURM_NODE_ALIASES='(null)'
SLURM_NPROCS=64
SLURM_NTASKS=64
SLURM_SUBMIT_DIR=/cluster/home/marcink
SLURM_SUBMIT_HOST=login-0-2.local
SLURM_TASKS_PER_NODE='3(x2),2,1(x3),2(x2),1,3(x3),5,1,4,1,3,2,3,7,1,5,6,1'

There is also a lot of warnings like

[compute-6-6.local:20158] MCW rank 4 is not bound (or bound to all available 
processors)


-----------
TEST RUN 2
-----------

In another allocation I got a different error

--------------------------------------------------------------------------
A request was made to bind to that would result in binding more
processes than cpus on a resource:

   Bind to:     CORE
   Node:        c6-19
   #processes:  2
   #cpus:       1

You can override this protection by adding the "overload-allowed"
option to your binding directive.
--------------------------------------------------------------------------

and the allocation was the following

SLURM_JOBID=12712250
SLURM_JOB_CPUS_PER_NODE='3(x2),2,1,15,1,3,16,2,1,3(x2),2,5,4'
SLURM_JOB_ID=12712250
SLURM_JOB_NODELIST='c6-[3,6-8,12,14,17,19,22-23],c8-[4,7,9,17,28]'
SLURM_JOB_NUM_NODES=15
SLURM_JOB_PARTITION=normal
SLURM_MEM_PER_CPU=2048
SLURM_NNODES=15
SLURM_NODELIST='c6-[3,6-8,12,14,17,19,22-23],c8-[4,7,9,17,28]'
SLURM_NODE_ALIASES='(null)'
SLURM_NPROCS=64
SLURM_NTASKS=64
SLURM_SUBMIT_DIR=/cluster/home/marcink
SLURM_SUBMIT_HOST=login-0-2.local
SLURM_TASKS_PER_NODE='3(x2),2,1,15,1,3,16,2,1,3(x2),2,5,4'


If in this case I run on only 32 cores

mpirun  --report-bindings --bind-to core -np 32 ./affinity

the process starts, but I get the original binding problem:

[compute-6-8.local:31414] MCW rank 8 is not bound (or bound to all available 
processors)

Running with --hetero-nodes yields exactly the same results





Hope the above is useful. The problem with binding under SLURM with CPU cores 
spread over nodes seems to be very reproducible. It is actually very often that 
OpenMPI dies with some error like above. These tests were run with 
openmpi-1.8.8 and 1.10.0, both giving same results.

One more suggestion. The warning message (MCW rank 8 is not bound...) is ONLY 
displayed when I use --report-bindings. It is never shown if I leave out this 
option, and although the binding is wrong the user is not notified. I think it 
would be better to show this warning in all cases binding fails.

Let me know if you need more information. I can help to debug this - it is a 
rather crucial issue.

Thanks!

Marcin






On 10/02/2015 11:49 PM, Ralph Castain wrote:

Can you please send me the allocation request you made (so I can see what you 
specified on the cmd line), and the mpirun cmd line?

Thanks
Ralph

On Oct 2, 2015, at 8:25 AM, Marcin Krotkiewski <marcin.krotkiew...@gmail.com> 
wrote:

Hi,

I fail to make OpenMPI bind to cores correctly when running from within 
SLURM-allocated CPU resources spread over a range of compute nodes in an 
otherwise homogeneous cluster. I have found this thread

http://www.open-mpi.org/community/lists/users/2014/06/24682.php

and did try to use what Ralph suggested there (--hetero-nodes), but it does not 
work (v. 1.10.0). When running with --report-bindings I get messages like

[compute-9-11.local:27571] MCW rank 10 is not bound (or bound to all available 
processors)

for all ranks outside of my first physical compute node. Moreover, everything 
works as expected if I ask SLURM to assign entire compute nodes. So it does 
look like Ralph's diagnose presented in that thread is correct, just the 
--hetero-nodes switch does not work for me.

I have written a short code that uses sched_getaffinity to print the effective 
bindings: all MPI ranks except of those on the first node are bound to all CPU 
cores allocated by SLURM.

Do I have to do something except of --hetero-nodes, or is this a problem that 
needs further investigation?

Thanks a lot!

Marcin

_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/10/27770.php

_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/10/27774.php

_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/10/27776.php

_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/10/27778.php

#define _GNU_SOURCE
#include <sched.h>
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
#include <unistd.h>

int main(int argc, char *argv[])
{
#define NCPUS 128
  cpu_set_t *mask = NULL;
  char hname[256];
  size_t size;
  int mpi_rank, mpi_size;

  MPI_Init(&argc, &argv);
  MPI_Comm_rank(MPI_COMM_WORLD, &mpi_rank);
  MPI_Comm_size(MPI_COMM_WORLD, &mpi_size);

  mask = CPU_ALLOC(NCPUS);
  size = CPU_ALLOC_SIZE(NCPUS);
  CPU_ZERO_S(size, mask);

  if ( sched_getaffinity(0, size, mask) == -1 ) {
    CPU_FREE(mask);
    perror("sched_getaffinity");
    return -1;
  }

  for(int r=0; r<mpi_size; r++){
    if(mpi_rank==r){
      gethostname(hname, 255);
      printf("rank %d @ %s ", mpi_rank, hname);
      for (int i = 0; i < NCPUS; i++ ) {
	if ( CPU_ISSET_S(i, size, mask) ) {
	  printf(" %d,", i);
	}
      }
      printf("\n"); fflush(stdout);
    }
    MPI_Barrier(MPI_COMM_WORLD);
  }

  MPI_Barrier(MPI_COMM_WORLD);
  MPI_Finalize();
}

Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

Reply via email to