Hi,

I fail to make OpenMPI bind to cores correctly when running from within SLURM-allocated CPU resources spread over a range of compute nodes in an otherwise homogeneous cluster. I have found this thread

http://www.open-mpi.org/community/lists/users/2014/06/24682.php

and did try to use what Ralph suggested there (--hetero-nodes), but it does not work (v. 1.10.0). When running with --report-bindings I get messages like

[compute-9-11.local:27571] MCW rank 10 is not bound (or bound to all available processors)

for all ranks outside of my first physical compute node. Moreover, everything works as expected if I ask SLURM to assign entire compute nodes. So it does look like Ralph's diagnose presented in that thread is correct, just the --hetero-nodes switch does not work for me.

I have written a short code that uses sched_getaffinity to print the effective bindings: all MPI ranks except of those on the first node are bound to all CPU cores allocated by SLURM.

Do I have to do something except of --hetero-nodes, or is this a problem that needs further investigation?

Thanks a lot!

Marcin

Reply via email to