Hi, Ralph,
I submit my slurm job as follows
salloc --ntasks=64 --mem-per-cpu=2G --time=1:0:0
Effectively, the allocated CPU cores are spread amount many cluster
nodes. SLURM uses cgroups to limit the CPU cores available for mpi
processes running on a given cluster node. Compute nodes are 2-socket,
8-core E5-2670 systems with HyperThreading on
node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31
node distances:
node 0 1
0: 10 21
1: 21 10
I run MPI program with command
mpirun --report-bindings --bind-to core -np 64 ./affinity
The program simply runs sched_getaffinity for each process and prints
out the result.
-----------
TEST RUN 1
-----------
For this particular job the problem is more severe: openmpi fails to run
at all with error
--------------------------------------------------------------------------
Open MPI tried to bind a new process, but something went wrong. The
process was killed without launching the target application. Your job
will now abort.
Local host: c6-6
Application name: ./affinity
Error message: hwloc_set_cpubind returned "Error" for bitmap "8,24"
Location: odls_default_module.c:551
--------------------------------------------------------------------------
This is SLURM environment variables:
SLURM_JOBID=12712225
SLURM_JOB_CPUS_PER_NODE='3(x2),2,1(x3),2(x2),1,3(x3),5,1,4,1,3,2,3,7,1,5,6,1'
SLURM_JOB_ID=12712225
SLURM_JOB_NODELIST='c6-[3,6-8,12,14,17,22-23],c8-[4,7,9,17,20,28],c15-[5,10,18,20,22-24,28],c16-11'
SLURM_JOB_NUM_NODES=24
SLURM_JOB_PARTITION=normal
SLURM_MEM_PER_CPU=2048
SLURM_NNODES=24
SLURM_NODELIST='c6-[3,6-8,12,14,17,22-23],c8-[4,7,9,17,20,28],c15-[5,10,18,20,22-24,28],c16-11'
SLURM_NODE_ALIASES='(null)'
SLURM_NPROCS=64
SLURM_NTASKS=64
SLURM_SUBMIT_DIR=/cluster/home/marcink
SLURM_SUBMIT_HOST=login-0-2.local
SLURM_TASKS_PER_NODE='3(x2),2,1(x3),2(x2),1,3(x3),5,1,4,1,3,2,3,7,1,5,6,1'
There is also a lot of warnings like
[compute-6-6.local:20158] MCW rank 4 is not bound (or bound to all
available processors)
-----------
TEST RUN 2
-----------
In another allocation I got a different error
--------------------------------------------------------------------------
A request was made to bind to that would result in binding more
processes than cpus on a resource:
Bind to: CORE
Node: c6-19
#processes: 2
#cpus: 1
You can override this protection by adding the "overload-allowed"
option to your binding directive.
--------------------------------------------------------------------------
and the allocation was the following
SLURM_JOBID=12712250
SLURM_JOB_CPUS_PER_NODE='3(x2),2,1,15,1,3,16,2,1,3(x2),2,5,4'
SLURM_JOB_ID=12712250
SLURM_JOB_NODELIST='c6-[3,6-8,12,14,17,19,22-23],c8-[4,7,9,17,28]'
SLURM_JOB_NUM_NODES=15
SLURM_JOB_PARTITION=normal
SLURM_MEM_PER_CPU=2048
SLURM_NNODES=15
SLURM_NODELIST='c6-[3,6-8,12,14,17,19,22-23],c8-[4,7,9,17,28]'
SLURM_NODE_ALIASES='(null)'
SLURM_NPROCS=64
SLURM_NTASKS=64
SLURM_SUBMIT_DIR=/cluster/home/marcink
SLURM_SUBMIT_HOST=login-0-2.local
SLURM_TASKS_PER_NODE='3(x2),2,1,15,1,3,16,2,1,3(x2),2,5,4'
If in this case I run on only 32 cores
mpirun --report-bindings --bind-to core -np 32 ./affinity
the process starts, but I get the original binding problem:
[compute-6-8.local:31414] MCW rank 8 is not bound (or bound to all
available processors)
Running with --hetero-nodes yields exactly the same results
Hope the above is useful. The problem with binding under SLURM with CPU
cores spread over nodes seems to be very reproducible. It is actually
very often that OpenMPI dies with some error like above. These tests
were run with openmpi-1.8.8 and 1.10.0, both giving same results.
One more suggestion. The warning message (MCW rank 8 is not bound...) is
ONLY displayed when I use --report-bindings. It is never shown if I
leave out this option, and although the binding is wrong the user is not
notified. I think it would be better to show this warning in all cases
binding fails.
Let me know if you need more information. I can help to debug this - it
is a rather crucial issue.
Thanks!
Marcin
On 10/02/2015 11:49 PM, Ralph Castain wrote:
Can you please send me the allocation request you made (so I can see what you
specified on the cmd line), and the mpirun cmd line?
Thanks
Ralph
On Oct 2, 2015, at 8:25 AM, Marcin Krotkiewski <marcin.krotkiew...@gmail.com>
wrote:
Hi,
I fail to make OpenMPI bind to cores correctly when running from within
SLURM-allocated CPU resources spread over a range of compute nodes in an
otherwise homogeneous cluster. I have found this thread
http://www.open-mpi.org/community/lists/users/2014/06/24682.php
and did try to use what Ralph suggested there (--hetero-nodes), but it does not
work (v. 1.10.0). When running with --report-bindings I get messages like
[compute-9-11.local:27571] MCW rank 10 is not bound (or bound to all available
processors)
for all ranks outside of my first physical compute node. Moreover, everything
works as expected if I ask SLURM to assign entire compute nodes. So it does
look like Ralph's diagnose presented in that thread is correct, just the
--hetero-nodes switch does not work for me.
I have written a short code that uses sched_getaffinity to print the effective
bindings: all MPI ranks except of those on the first node are bound to all CPU
cores allocated by SLURM.
Do I have to do something except of --hetero-nodes, or is this a problem that
needs further investigation?
Thanks a lot!
Marcin
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2015/10/27770.php
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2015/10/27774.php