Marcin, could you give a try at v1.10.1rc1 that was released today ? it fixes a bug when hwloc was trying to bind outside the cpuset.
Ralph and all, imho, there are several issues here - if slurm allocates threads instead of core, then the --oversubscribe mpirun option could be mandatory - with --oversubscribe --hetero-nodes, mpirun should not fail, and if it still fails with v1.10.1rc1, I will ask some more details in order to fix ompi Cheers, Gilles On Saturday, October 3, 2015, Ralph Castain <r...@open-mpi.org> wrote: > Thanks Marcin. Looking at this, I’m guessing that Slurm may be treating > HTs as “cores” - i.e., as independent cpus. Any chance that is true? > > I’m wondering because bind-to core will attempt to bind your proc to both > HTs on the core. For some reason, we thought that 8.24 were HTs on the same > core, which is why we tried to bind to that pair of HTs. We got an error > because HT #24 was not allocated to us on node c6, but HT #8 was. > > > > On Oct 3, 2015, at 2:43 AM, marcin.krotkiewski < > marcin.krotkiew...@gmail.com <javascript:;>> wrote: > > > > Hi, Ralph, > > > > I submit my slurm job as follows > > > > salloc --ntasks=64 --mem-per-cpu=2G --time=1:0:0 > > > > Effectively, the allocated CPU cores are spread amount many cluster > nodes. SLURM uses cgroups to limit the CPU cores available for mpi > processes running on a given cluster node. Compute nodes are 2-socket, > 8-core E5-2670 systems with HyperThreading on > > > > node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23 > > node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31 > > node distances: > > node 0 1 > > 0: 10 21 > > 1: 21 10 > > > > I run MPI program with command > > > > mpirun --report-bindings --bind-to core -np 64 ./affinity > > > > The program simply runs sched_getaffinity for each process and prints > out the result. > > > > ----------- > > TEST RUN 1 > > ----------- > > For this particular job the problem is more severe: openmpi fails to run > at all with error > > > > > -------------------------------------------------------------------------- > > Open MPI tried to bind a new process, but something went wrong. The > > process was killed without launching the target application. Your job > > will now abort. > > > > Local host: c6-6 > > Application name: ./affinity > > Error message: hwloc_set_cpubind returned "Error" for bitmap "8,24" > > Location: odls_default_module.c:551 > > > -------------------------------------------------------------------------- > > > > This is SLURM environment variables: > > > > SLURM_JOBID=12712225 > > > SLURM_JOB_CPUS_PER_NODE='3(x2),2,1(x3),2(x2),1,3(x3),5,1,4,1,3,2,3,7,1,5,6,1' > > SLURM_JOB_ID=12712225 > > > SLURM_JOB_NODELIST='c6-[3,6-8,12,14,17,22-23],c8-[4,7,9,17,20,28],c15-[5,10,18,20,22-24,28],c16-11' > > SLURM_JOB_NUM_NODES=24 > > SLURM_JOB_PARTITION=normal > > SLURM_MEM_PER_CPU=2048 > > SLURM_NNODES=24 > > > SLURM_NODELIST='c6-[3,6-8,12,14,17,22-23],c8-[4,7,9,17,20,28],c15-[5,10,18,20,22-24,28],c16-11' > > SLURM_NODE_ALIASES='(null)' > > SLURM_NPROCS=64 > > SLURM_NTASKS=64 > > SLURM_SUBMIT_DIR=/cluster/home/marcink > > SLURM_SUBMIT_HOST=login-0-2.local > > > SLURM_TASKS_PER_NODE='3(x2),2,1(x3),2(x2),1,3(x3),5,1,4,1,3,2,3,7,1,5,6,1' > > > > There is also a lot of warnings like > > > > [compute-6-6.local:20158] MCW rank 4 is not bound (or bound to all > available processors) > > > > > > ----------- > > TEST RUN 2 > > ----------- > > > > In another allocation I got a different error > > > > > -------------------------------------------------------------------------- > > A request was made to bind to that would result in binding more > > processes than cpus on a resource: > > > > Bind to: CORE > > Node: c6-19 > > #processes: 2 > > #cpus: 1 > > > > You can override this protection by adding the "overload-allowed" > > option to your binding directive. > > > -------------------------------------------------------------------------- > > > > and the allocation was the following > > > > SLURM_JOBID=12712250 > > SLURM_JOB_CPUS_PER_NODE='3(x2),2,1,15,1,3,16,2,1,3(x2),2,5,4' > > SLURM_JOB_ID=12712250 > > SLURM_JOB_NODELIST='c6-[3,6-8,12,14,17,19,22-23],c8-[4,7,9,17,28]' > > SLURM_JOB_NUM_NODES=15 > > SLURM_JOB_PARTITION=normal > > SLURM_MEM_PER_CPU=2048 > > SLURM_NNODES=15 > > SLURM_NODELIST='c6-[3,6-8,12,14,17,19,22-23],c8-[4,7,9,17,28]' > > SLURM_NODE_ALIASES='(null)' > > SLURM_NPROCS=64 > > SLURM_NTASKS=64 > > SLURM_SUBMIT_DIR=/cluster/home/marcink > > SLURM_SUBMIT_HOST=login-0-2.local > > SLURM_TASKS_PER_NODE='3(x2),2,1,15,1,3,16,2,1,3(x2),2,5,4' > > > > > > If in this case I run on only 32 cores > > > > mpirun --report-bindings --bind-to core -np 32 ./affinity > > > > the process starts, but I get the original binding problem: > > > > [compute-6-8.local:31414] MCW rank 8 is not bound (or bound to all > available processors) > > > > Running with --hetero-nodes yields exactly the same results > > > > > > > > > > > > Hope the above is useful. The problem with binding under SLURM with CPU > cores spread over nodes seems to be very reproducible. It is actually very > often that OpenMPI dies with some error like above. These tests were run > with openmpi-1.8.8 and 1.10.0, both giving same results. > > > > One more suggestion. The warning message (MCW rank 8 is not bound...) is > ONLY displayed when I use --report-bindings. It is never shown if I leave > out this option, and although the binding is wrong the user is not > notified. I think it would be better to show this warning in all cases > binding fails. > > > > Let me know if you need more information. I can help to debug this - it > is a rather crucial issue. > > > > Thanks! > > > > Marcin > > > > > > > > > > > > > > On 10/02/2015 11:49 PM, Ralph Castain wrote: > >> Can you please send me the allocation request you made (so I can see > what you specified on the cmd line), and the mpirun cmd line? > >> > >> Thanks > >> Ralph > >> > >>> On Oct 2, 2015, at 8:25 AM, Marcin Krotkiewski < > marcin.krotkiew...@gmail.com <javascript:;>> wrote: > >>> > >>> Hi, > >>> > >>> I fail to make OpenMPI bind to cores correctly when running from > within SLURM-allocated CPU resources spread over a range of compute nodes > in an otherwise homogeneous cluster. I have found this thread > >>> > >>> http://www.open-mpi.org/community/lists/users/2014/06/24682.php > >>> > >>> and did try to use what Ralph suggested there (--hetero-nodes), but it > does not work (v. 1.10.0). When running with --report-bindings I get > messages like > >>> > >>> [compute-9-11.local:27571] MCW rank 10 is not bound (or bound to all > available processors) > >>> > >>> for all ranks outside of my first physical compute node. Moreover, > everything works as expected if I ask SLURM to assign entire compute nodes. > So it does look like Ralph's diagnose presented in that thread is correct, > just the --hetero-nodes switch does not work for me. > >>> > >>> I have written a short code that uses sched_getaffinity to print the > effective bindings: all MPI ranks except of those on the first node are > bound to all CPU cores allocated by SLURM. > >>> > >>> Do I have to do something except of --hetero-nodes, or is this a > problem that needs further investigation? > >>> > >>> Thanks a lot! > >>> > >>> Marcin > >>> > >>> _______________________________________________ > >>> users mailing list > >>> us...@open-mpi.org <javascript:;> > >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > >>> Link to this post: > http://www.open-mpi.org/community/lists/users/2015/10/27770.php > >> _______________________________________________ > >> users mailing list > >> us...@open-mpi.org <javascript:;> > >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > >> Link to this post: > http://www.open-mpi.org/community/lists/users/2015/10/27774.php > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org <javascript:;> > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/10/27776.php > > _______________________________________________ > users mailing list > us...@open-mpi.org <javascript:;> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/10/27778.php