Re: [OMPI users] Hybrid OpenMPI+OpenMP tasks using SLURM

Ralph Castain Thu, 8 Oct 2015 20:07:57 -0400 (EDT)

Hi Marcin

Looking again at this: could you get a similar reservation again and rerun 
mpirun with “-display-allocation” added to the command line? I’d like to see if 
we are correctly parsing the number of slots assigned in the allocation


Ralph

> On Oct 6, 2015, at 11:52 AM, marcin.krotkiewski 
> <marcin.krotkiew...@gmail.com> wrote:
> 
> Thank you both for your suggestion. I still cannot make this work though, and 
> I think - as Ralph predicted - most problems are likely related to 
> non-homogeneous mapping of cpus to jobs. But there is problems even before 
> that part..
> 
> If I reserve one entire compute node with SLURM:
> 
> salloc --ntasks=16 --tasks-per-node=16
> 
> I can run my code as you suggested with _any_ N (including odd numbers!). 
> OpenMPI will figure out the maximun number of tasks that fits and launch 
> them. This also works for many complete nodes, but this is the only case when 
> I managed to get it to work.
> 
> If I specify cpus per task, also allocating one full node
> 
> salloc --ntasks=4 --cpus-per-task=4 --tasks-per-node=4
> 
> things go astray:
> 
> mpirun --map-by slot:pe=4 ./affinity
> rank 0 @ compute-1-6.local  0, 1, 2, 3, 16, 17, 18, 19,
> 
> Yes, only one MPI process was started. Running what Gilles previously 
> suggested:
> 
> $ srun grep Cpus_allowed_list /proc/self/status
> Cpus_allowed_list:    0-31
> Cpus_allowed_list:    0-31
> Cpus_allowed_list:    0-31
> Cpus_allowed_list:    0-31
> 
> So the allocation seems fine. The SLURM environment is also correct, as far 
> as I can tell:
> 
> SLURM_CPUS_PER_TASK=4
> SLURM_JOB_CPUS_PER_NODE=16
> SLURM_JOB_NODELIST=c1-6
> SLURM_JOB_NUM_NODES=1
> SLURM_NNODES=1
> SLURM_NODELIST=c1-6
> SLURM_NPROCS=4
> SLURM_NTASKS=4
> SLURM_NTASKS_PER_NODE=4
> SLURM_TASKS_PER_NODE=4
> 
> I do not understand why openmpi does not want to start more than 1 process. 
> If I try to force it (-n 4) I of course get an error:
> 
> mpirun --map-by slot:pe=4 -n 4 ./affinity
> 
> --------------------------------------------------------------------------
> There are not enough slots available in the system to satisfy the 4 slots
> that were requested by the application:
>  ./affinity
> 
> Either request fewer slots for your application, or make more slots available
> for use.
> --------------------------------------------------------------------------
> 
> 
> For clarity, I will not describe other cases / non-contiguous cpu sets / 
> heterogeneous nodes. Clearly something is wrong already with the simple ones..
> 
> Does anyone have any ideas? Should I record some logs to see what's going on?
> 
> Thanks a lot!
> 
> Marcin
> 
> 
> 
> 
> 
> 
> On 10/06/2015 01:04 AM, tmish...@jcity.maeda.co.jp wrote:
>> Hi Ralph, it's been a long time.
>> 
>> The option "map-by core" does not work when pe=N > 1 is specified.
>> So, you should use "map-by slot:pe=N" as far as I remember.
>> 
>> Regards,
>> Tetsuya Mishima
>> 
>> 2015/10/06 5:40:33、"users"さんは「Re: [OMPI users] Hybrid OpenMPI+OpenMP
>> tasks using SLURM」で書きました
>>> Hmmm…okay, try -map-by socket:pe=4
>>> 
>>> We’ll still hit the asymmetric topology issue, but otherwise this should
>> work
>>> 
>>>> On Oct 5, 2015, at 1:25 PM, marcin.krotkiewski
>> <marcin.krotkiew...@gmail.com> wrote:
>>>> Ralph,
>>>> 
>>>> Thank you for a fast response! Sounds very good, unfortunately I get an
>> error:
>>>> $ mpirun --map-by core:pe=4 ./affinity
>>>> 
>> --------------------------------------------------------------------------
>>>> A request for multiple cpus-per-proc was given, but a directive
>>>> was also give to map to an object level that cannot support that
>>>> directive.
>>>> 
>>>> Please specify a mapping level that has more than one cpu, or
>>>> else let us define a default mapping that will allow multiple
>>>> cpus-per-proc.
>>>> 
>> --------------------------------------------------------------------------
>>>> I have allocated my slurm job as
>>>> 
>>>> salloc --ntasks=2 --cpus-per-task=4
>>>> 
>>>> I have checked in 1.10.0 and 1.10.1rc1.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On 10/05/2015 09:58 PM, Ralph Castain wrote:
>>>>> You would presently do:
>>>>> 
>>>>> mpirun —map-by core:pe=4
>>>>> 
>>>>> to get what you are seeking. If we don’t already set that qualifier
>> when we see “cpus_per_task”, then we probably should do so as there isn’t
>> any reason to make you set it twice (well, other than
>>> trying to track which envar slurm is using now).
>>>>> 
>>>>>> On Oct 5, 2015, at 12:38 PM, marcin.krotkiewski
>> <marcin.krotkiew...@gmail.com> wrote:
>>>>>> Yet another question about cpu binding under SLURM environment..
>>>>>> 
>>>>>> Short version: will OpenMPI support SLURM_CPUS_PER_TASK for the
>> purpose of cpu binding?
>>>>>> 
>>>>>> Full version: When you allocate a job like, e.g., this
>>>>>> 
>>>>>> salloc --ntasks=2 --cpus-per-task=4
>>>>>> 
>>>>>> SLURM will allocate 8 cores in total, 4 for each 'assumed' MPI tasks.
>> This is useful for hybrid jobs, where each MPI process spawns some internal
>> worker threads (e.g., OpenMP). The intention is
>>> that there are 2 MPI procs started, each of them 'bound' to 4 cores.
>> SLURM will also set an environment variable
>>>>>> SLURM_CPUS_PER_TASK=4
>>>>>> 
>>>>>> which should (probably?) be taken into account by the method that
>> launches the MPI processes to figure out the cpuset. In case of OpenMPI +
>> mpirun I think something should happen in
>>> orte/mca/ras/slurm/ras_slurm_module.c, where the variable _is_ actually
>> parsed. Unfortunately, it is never really used...
>>>>>> As a result, cpuset of all tasks started on a given compute node
>> includes all CPU cores of all MPI tasks on that node, just as provided by
>> SLURM (in the above example - 8). In general, there is
>>> no simple way for the user code in the MPI procs to 'split' the cores
>> between themselves. I imagine the original intention to support this in
>> OpenMPI was something like
>>>>>> mpirun --bind-to subtask_cpuset
>>>>>> 
>>>>>> with an artificial bind target that would cause OpenMPI to divide the
>> allocated cores between the mpi tasks. Is this right? If so, it seems that
>> at this point this is not implemented. Is there
>>> plans to do this? If no, does anyone know another way to achieve that?
>>>>>> Thanks a lot!
>>>>>> 
>>>>>> Marcin
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org
>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2015/10/27803.php
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2015/10/27804.php
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2015/10/27805.php
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/usersLink to
>> this post: http://www.open-mpi.org/community/lists/users/2015/10/27806.php
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2015/10/27809.php
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/10/27817.php

Re: [OMPI users] Hybrid OpenMPI+OpenMP tasks using SLURM

Reply via email to