Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

Ralph Castain Sat, 3 Oct 2015 07:06:19 -0400 (EDT)

Thanks Marcin. Looking at this, I’m guessing that Slurm may be treating HTs as 
“cores” - i.e., as independent cpus. Any chance that is true?


I’m wondering because bind-to core will attempt to bind your proc to both HTs 
on the core. For some reason, we thought that 8.24 were HTs on the same core, 
which is why we tried to bind to that pair of HTs. We got an error because HT 
#24 was not allocated to us on node c6, but HT #8 was.


> On Oct 3, 2015, at 2:43 AM, marcin.krotkiewski <marcin.krotkiew...@gmail.com> 
> wrote:
> 
> Hi, Ralph,
> 
> I submit my slurm job as follows
> 
> salloc --ntasks=64 --mem-per-cpu=2G --time=1:0:0
> 
> Effectively, the allocated CPU cores are spread amount many cluster nodes. 
> SLURM uses cgroups to limit the CPU cores available for mpi processes running 
> on a given cluster node. Compute nodes are 2-socket, 8-core E5-2670 systems 
> with HyperThreading on
> 
> node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
> node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31
> node distances:
> node   0   1
>  0:  10  21
>  1:  21  10
> 
> I run MPI program with command
> 
> mpirun  --report-bindings --bind-to core -np 64 ./affinity
> 
> The program simply runs sched_getaffinity for each process and prints out the 
> result.
> 
> -----------
> TEST RUN 1
> -----------
> For this particular job the problem is more severe: openmpi fails to run at 
> all with error
> 
> --------------------------------------------------------------------------
> Open MPI tried to bind a new process, but something went wrong.  The
> process was killed without launching the target application.  Your job
> will now abort.
> 
>  Local host:        c6-6
>  Application name:  ./affinity
>  Error message:     hwloc_set_cpubind returned "Error" for bitmap "8,24"
>  Location:          odls_default_module.c:551
> --------------------------------------------------------------------------
> 
> This is SLURM environment variables:
> 
> SLURM_JOBID=12712225
> SLURM_JOB_CPUS_PER_NODE='3(x2),2,1(x3),2(x2),1,3(x3),5,1,4,1,3,2,3,7,1,5,6,1'
> SLURM_JOB_ID=12712225
> SLURM_JOB_NODELIST='c6-[3,6-8,12,14,17,22-23],c8-[4,7,9,17,20,28],c15-[5,10,18,20,22-24,28],c16-11'
> SLURM_JOB_NUM_NODES=24
> SLURM_JOB_PARTITION=normal
> SLURM_MEM_PER_CPU=2048
> SLURM_NNODES=24
> SLURM_NODELIST='c6-[3,6-8,12,14,17,22-23],c8-[4,7,9,17,20,28],c15-[5,10,18,20,22-24,28],c16-11'
> SLURM_NODE_ALIASES='(null)'
> SLURM_NPROCS=64
> SLURM_NTASKS=64
> SLURM_SUBMIT_DIR=/cluster/home/marcink
> SLURM_SUBMIT_HOST=login-0-2.local
> SLURM_TASKS_PER_NODE='3(x2),2,1(x3),2(x2),1,3(x3),5,1,4,1,3,2,3,7,1,5,6,1'
> 
> There is also a lot of warnings like
> 
> [compute-6-6.local:20158] MCW rank 4 is not bound (or bound to all available 
> processors)
> 
> 
> -----------
> TEST RUN 2
> -----------
> 
> In another allocation I got a different error
> 
> --------------------------------------------------------------------------
> A request was made to bind to that would result in binding more
> processes than cpus on a resource:
> 
>   Bind to:     CORE
>   Node:        c6-19
>   #processes:  2
>   #cpus:       1
> 
> You can override this protection by adding the "overload-allowed"
> option to your binding directive.
> --------------------------------------------------------------------------
> 
> and the allocation was the following
> 
> SLURM_JOBID=12712250
> SLURM_JOB_CPUS_PER_NODE='3(x2),2,1,15,1,3,16,2,1,3(x2),2,5,4'
> SLURM_JOB_ID=12712250
> SLURM_JOB_NODELIST='c6-[3,6-8,12,14,17,19,22-23],c8-[4,7,9,17,28]'
> SLURM_JOB_NUM_NODES=15
> SLURM_JOB_PARTITION=normal
> SLURM_MEM_PER_CPU=2048
> SLURM_NNODES=15
> SLURM_NODELIST='c6-[3,6-8,12,14,17,19,22-23],c8-[4,7,9,17,28]'
> SLURM_NODE_ALIASES='(null)'
> SLURM_NPROCS=64
> SLURM_NTASKS=64
> SLURM_SUBMIT_DIR=/cluster/home/marcink
> SLURM_SUBMIT_HOST=login-0-2.local
> SLURM_TASKS_PER_NODE='3(x2),2,1,15,1,3,16,2,1,3(x2),2,5,4'
> 
> 
> If in this case I run on only 32 cores
> 
> mpirun  --report-bindings --bind-to core -np 32 ./affinity
> 
> the process starts, but I get the original binding problem:
> 
> [compute-6-8.local:31414] MCW rank 8 is not bound (or bound to all available 
> processors)
> 
> Running with --hetero-nodes yields exactly the same results
> 
> 
> 
> 
> 
> Hope the above is useful. The problem with binding under SLURM with CPU cores 
> spread over nodes seems to be very reproducible. It is actually very often 
> that OpenMPI dies with some error like above. These tests were run with 
> openmpi-1.8.8 and 1.10.0, both giving same results.
> 
> One more suggestion. The warning message (MCW rank 8 is not bound...) is ONLY 
> displayed when I use --report-bindings. It is never shown if I leave out this 
> option, and although the binding is wrong the user is not notified. I think 
> it would be better to show this warning in all cases binding fails.
> 
> Let me know if you need more information. I can help to debug this - it is a 
> rather crucial issue.
> 
> Thanks!
> 
> Marcin
> 
> 
> 
> 
> 
> 
> On 10/02/2015 11:49 PM, Ralph Castain wrote:
>> Can you please send me the allocation request you made (so I can see what 
>> you specified on the cmd line), and the mpirun cmd line?
>> 
>> Thanks
>> Ralph
>> 
>>> On Oct 2, 2015, at 8:25 AM, Marcin Krotkiewski 
>>> <marcin.krotkiew...@gmail.com> wrote:
>>> 
>>> Hi,
>>> 
>>> I fail to make OpenMPI bind to cores correctly when running from within 
>>> SLURM-allocated CPU resources spread over a range of compute nodes in an 
>>> otherwise homogeneous cluster. I have found this thread
>>> 
>>> http://www.open-mpi.org/community/lists/users/2014/06/24682.php
>>> 
>>> and did try to use what Ralph suggested there (--hetero-nodes), but it does 
>>> not work (v. 1.10.0). When running with --report-bindings I get messages 
>>> like
>>> 
>>> [compute-9-11.local:27571] MCW rank 10 is not bound (or bound to all 
>>> available processors)
>>> 
>>> for all ranks outside of my first physical compute node. Moreover, 
>>> everything works as expected if I ask SLURM to assign entire compute nodes. 
>>> So it does look like Ralph's diagnose presented in that thread is correct, 
>>> just the --hetero-nodes switch does not work for me.
>>> 
>>> I have written a short code that uses sched_getaffinity to print the 
>>> effective bindings: all MPI ranks except of those on the first node are 
>>> bound to all CPU cores allocated by SLURM.
>>> 
>>> Do I have to do something except of --hetero-nodes, or is this a problem 
>>> that needs further investigation?
>>> 
>>> Thanks a lot!
>>> 
>>> Marcin
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/users/2015/10/27770.php
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2015/10/27774.php
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/10/27776.php

Re: [OMPI users] Process binding with SLURM and 'heterogeneous' nodes

Reply via email to