Hi,
I fail to make OpenMPI bind to cores correctly when running from within
SLURM-allocated CPU resources spread over a range of compute nodes in an
otherwise homogeneous cluster. I have found this thread
http://www.open-mpi.org/community/lists/users/2014/06/24682.php
and did try to use what Ralph suggested there (--hetero-nodes), but it
does not work (v. 1.10.0). When running with --report-bindings I get
messages like
[compute-9-11.local:27571] MCW rank 10 is not bound (or bound to all
available processors)
for all ranks outside of my first physical compute node. Moreover,
everything works as expected if I ask SLURM to assign entire compute
nodes. So it does look like Ralph's diagnose presented in that thread is
correct, just the --hetero-nodes switch does not work for me.
I have written a short code that uses sched_getaffinity to print the
effective bindings: all MPI ranks except of those on the first node are
bound to all CPU cores allocated by SLURM.
Do I have to do something except of --hetero-nodes, or is this a problem
that needs further investigation?
Thanks a lot!
Marcin