Re: [OMPI users] Hybrid OpenMPI+OpenMP tasks using SLURM

Marcin Krotkiewski Fri, 9 Oct 2015 09:30:24 -0400 (EDT)


Thank you, Ralph. The world wan wait, no problem :)


Marcin


On 10/09/2015 03:27 PM, Ralph Castain wrote:

Actually, you just confirmed the problem for me. You are correct in that it 
says 4 slots. However, if you then tell us pe=4, we will consume all 4 of those 
slots with the very first process.

What we need to see was that slurm was assigning us 16 slots to correspond to 
16 cpus. Instead, it is trying to tell us to launch only 4 procs, but to use 16 
cpus as if they belong to us. This is where the confusion is coming from - 
could be something in the slurm envar syntax changed, or something else did as 
I seem to recall we handled this okay before (but I could be wrong).

Fixing that will take some time that I honestly won’t have for awhile.

On Oct 9, 2015, at 6:14 AM, Marcin Krotkiewski <marcin.krotkiew...@gmail.com> 
wrote:

Ralph,

Here is the result running

mpirun --map-by slot:pe=4 -display-allocation ./affinity

======================   ALLOCATED NODES   ======================
    c12-29: slots=4 max_slots=0 slots_inuse=0 state=UP
=================================================================
rank 0 @ compute-12-29.local  1, 2, 3, 4, 17, 18, 19, 20,

I also attach output with --mca rmaps_base_verbose 10. It says 4 slots all over 
the place, so it is really weird it should not work.

Thanks!

Marcin



[login-0-1.local:30710] mca: base: components_register: registering rmaps 
components
[login-0-1.local:30710] mca: base: components_register: found loaded component 
round_robin
[login-0-1.local:30710] mca: base: components_register: component round_robin 
register function successful
[login-0-1.local:30710] mca: base: components_register: found loaded component 
rank_file
[login-0-1.local:30710] mca: base: components_register: component rank_file 
register function successful
[login-0-1.local:30710] mca: base: components_register: found loaded component 
seq
[login-0-1.local:30710] mca: base: components_register: component seq register 
function successful
[login-0-1.local:30710] mca: base: components_register: found loaded component 
resilient
[login-0-1.local:30710] mca: base: components_register: component resilient 
register function successful
[login-0-1.local:30710] mca: base: components_register: found loaded component 
staged
[login-0-1.local:30710] mca: base: components_register: component staged has no 
register or open function
[login-0-1.local:30710] mca: base: components_register: found loaded component 
mindist
[login-0-1.local:30710] mca: base: components_register: component mindist 
register function successful
[login-0-1.local:30710] mca: base: components_register: found loaded component 
ppr
[login-0-1.local:30710] mca: base: components_register: component ppr register 
function successful
[login-0-1.local:30710] [[61064,0],0] rmaps:base set policy with slot:pe=4
[login-0-1.local:30710] [[61064,0],0] rmaps:base policy slot modifiers pe=4 
provided
[login-0-1.local:30710] [[61064,0],0] rmaps:base check modifiers with pe=4
[login-0-1.local:30710] [[61064,0],0] rmaps:base setting pe/rank to 4
[login-0-1.local:30710] mca: base: components_open: opening rmaps components
[login-0-1.local:30710] mca: base: components_open: found loaded component 
round_robin
[login-0-1.local:30710] mca: base: components_open: component round_robin open 
function successful
[login-0-1.local:30710] mca: base: components_open: found loaded component 
rank_file
[login-0-1.local:30710] mca: base: components_open: component rank_file open 
function successful
[login-0-1.local:30710] mca: base: components_open: found loaded component seq
[login-0-1.local:30710] mca: base: components_open: component seq open function 
successful
[login-0-1.local:30710] mca: base: components_open: found loaded component 
resilient
[login-0-1.local:30710] mca: base: components_open: component resilient open 
function successful
[login-0-1.local:30710] mca: base: components_open: found loaded component 
staged
[login-0-1.local:30710] mca: base: components_open: component staged open 
function successful
[login-0-1.local:30710] mca: base: components_open: found loaded component 
mindist
[login-0-1.local:30710] mca: base: components_open: component mindist open 
function successful
[login-0-1.local:30710] mca: base: components_open: found loaded component ppr
[login-0-1.local:30710] mca: base: components_open: component ppr open function 
successful
[login-0-1.local:30710] mca:rmaps:select: checking available component 
round_robin
[login-0-1.local:30710] mca:rmaps:select: Querying component [round_robin]
[login-0-1.local:30710] mca:rmaps:select: checking available component rank_file
[login-0-1.local:30710] mca:rmaps:select: Querying component [rank_file]
[login-0-1.local:30710] mca:rmaps:select: checking available component seq
[login-0-1.local:30710] mca:rmaps:select: Querying component [seq]
[login-0-1.local:30710] mca:rmaps:select: checking available component resilient
[login-0-1.local:30710] mca:rmaps:select: Querying component [resilient]
[login-0-1.local:30710] mca:rmaps:select: checking available component staged
[login-0-1.local:30710] mca:rmaps:select: Querying component [staged]
[login-0-1.local:30710] mca:rmaps:select: checking available component mindist
[login-0-1.local:30710] mca:rmaps:select: Querying component [mindist]
[login-0-1.local:30710] mca:rmaps:select: checking available component ppr
[login-0-1.local:30710] mca:rmaps:select: Querying component [ppr]
[login-0-1.local:30710] [[61064,0],0]: Final mapper priorities
[login-0-1.local:30710]     Mapper: ppr Priority: 90
[login-0-1.local:30710]     Mapper: seq Priority: 60
[login-0-1.local:30710]     Mapper: resilient Priority: 40
[login-0-1.local:30710]     Mapper: mindist Priority: 20
[login-0-1.local:30710]     Mapper: round_robin Priority: 10
[login-0-1.local:30710]     Mapper: staged Priority: 5
[login-0-1.local:30710]     Mapper: rank_file Priority: 0

======================   ALLOCATED NODES   ======================
    c12-29: slots=4 max_slots=0 slots_inuse=0 state=UP
=================================================================
[login-0-1.local:30710] mca:rmaps: mapping job [61064,1]
[login-0-1.local:30710] mca:rmaps: creating new map for job [61064,1]
[login-0-1.local:30710] AVAILABLE NODES FOR MAPPING:
[login-0-1.local:30710]     node: c12-29 daemon: 1
[login-0-1.local:30710] mca:rmaps: nprocs 4
[login-0-1.local:30710] mca:rmaps mapping given - using default
[login-0-1.local:30710] mca:rmaps:ppr: job [61064,1] not using ppr mapper
[login-0-1.local:30710] mca:rmaps:seq: job [61064,1] not using seq mapper
[login-0-1.local:30710] mca:rmaps:resilient: cannot perform initial map of job 
[61064,1] - no fault groups
[login-0-1.local:30710] mca:rmaps:mindist: job [61064,1] not using mindist 
mapper
[login-0-1.local:30710] mca:rmaps:rr: mapping job [61064,1]
[login-0-1.local:30710] AVAILABLE NODES FOR MAPPING:
[login-0-1.local:30710]     node: c12-29 daemon: 1
[login-0-1.local:30710] mca:rmaps:rr: mapping by slot for job [61064,1] slots 4 
num_procs 1
[login-0-1.local:30710] mca:rmaps:rr:slot working node c12-29
[login-0-1.local:30710] mca:rmaps:rr:slot assigning 1 procs to node c12-29
[login-0-1.local:30710] mca:rmaps:base: computing vpids by slot for job 
[61064,1]
[login-0-1.local:30710] mca:rmaps:base: assigning rank 0 to node c12-29
[login-0-1.local:30710] mca:rmaps: compute bindings for job [61064,1] with 
policy CORE:IF-SUPPORTED[5008]
[login-0-1.local:30710] [[61064,0],0] reset_usage: node c12-29 has 1 procs on it
[login-0-1.local:30710] [[61064,0],0] reset_usage: ignoring proc [[61064,1],0]
[login-0-1.local:30710] [[61064,0],0] bind_depth: 6 map_depth 0
[login-0-1.local:30710] mca:rmaps: bind downward for job [61064,1] with 
bindings CORE:IF-SUPPORTED
[login-0-1.local:30710] [[61064,0],0] GOT 1 CPUS
[login-0-1.local:30710] [[61064,0],0] GOT 1 CPUS
[login-0-1.local:30710] [[61064,0],0] GOT 1 CPUS
[login-0-1.local:30710] [[61064,0],0] GOT 1 CPUS
[login-0-1.local:30710] [[61064,0],0] PROC [[61064,1],0] BITMAP 0-3,16-19
[login-0-1.local:30710] [[61064,0],0] BOUND PROC [[61064,1],0][c12-29] TO 
socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 
0-1]], socket 0[core 3[hwt 0-1]]: 
[BB/BB/BB/BB/../../../..][../../../../../../../..]
rank 0 @ compute-12-29.local  1, 2, 3, 4, 17, 18, 19, 20,
[login-0-1.local:30710] mca: base: close: component round_robin closed
[login-0-1.local:30710] mca: base: close: unloading component round_robin
[login-0-1.local:30710] mca: base: close: component rank_file closed
[login-0-1.local:30710] mca: base: close: unloading component rank_file
[login-0-1.local:30710] mca: base: close: component seq closed
[login-0-1.local:30710] mca: base: close: unloading component seq
[login-0-1.local:30710] mca: base: close: component resilient closed
[login-0-1.local:30710] mca: base: close: unloading component resilient
[login-0-1.local:30710] mca: base: close: component staged closed
[login-0-1.local:30710] mca: base: close: unloading component staged
[login-0-1.local:30710] mca: base: close: component mindist closed
[login-0-1.local:30710] mca: base: close: unloading component mindist
[login-0-1.local:30710] mca: base: close: component ppr closed
[login-0-1.local:30710] mca: base: close: unloading component ppr





On 10/09/2015 02:07 AM, Ralph Castain wrote:

Hi Marcin

Looking again at this: could you get a similar reservation again and rerun 
mpirun with “-display-allocation” added to the command line? I’d like to see if 
we are correctly parsing the number of slots assigned in the allocation

Ralph

On Oct 6, 2015, at 11:52 AM, marcin.krotkiewski <marcin.krotkiew...@gmail.com> 
wrote:

Thank you both for your suggestion. I still cannot make this work though, and I 
think - as Ralph predicted - most problems are likely related to 
non-homogeneous mapping of cpus to jobs. But there is problems even before that 
part..

If I reserve one entire compute node with SLURM:

salloc --ntasks=16 --tasks-per-node=16

I can run my code as you suggested with _any_ N (including odd numbers!). 
OpenMPI will figure out the maximun number of tasks that fits and launch them. 
This also works for many complete nodes, but this is the only case when I 
managed to get it to work.

If I specify cpus per task, also allocating one full node

salloc --ntasks=4 --cpus-per-task=4 --tasks-per-node=4

things go astray:

mpirun --map-by slot:pe=4 ./affinity
rank 0 @ compute-1-6.local  0, 1, 2, 3, 16, 17, 18, 19,

Yes, only one MPI process was started. Running what Gilles previously suggested:

$ srun grep Cpus_allowed_list /proc/self/status
Cpus_allowed_list:    0-31
Cpus_allowed_list:    0-31
Cpus_allowed_list:    0-31
Cpus_allowed_list:    0-31

So the allocation seems fine. The SLURM environment is also correct, as far as 
I can tell:

SLURM_CPUS_PER_TASK=4
SLURM_JOB_CPUS_PER_NODE=16
SLURM_JOB_NODELIST=c1-6
SLURM_JOB_NUM_NODES=1
SLURM_NNODES=1
SLURM_NODELIST=c1-6
SLURM_NPROCS=4
SLURM_NTASKS=4
SLURM_NTASKS_PER_NODE=4
SLURM_TASKS_PER_NODE=4

I do not understand why openmpi does not want to start more than 1 process. If 
I try to force it (-n 4) I of course get an error:

mpirun --map-by slot:pe=4 -n 4 ./affinity

--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 4 slots
that were requested by the application:
  ./affinity

Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------


For clarity, I will not describe other cases / non-contiguous cpu sets / 
heterogeneous nodes. Clearly something is wrong already with the simple ones..

Does anyone have any ideas? Should I record some logs to see what's going on?

Thanks a lot!

Marcin






On 10/06/2015 01:04 AM, tmish...@jcity.maeda.co.jp wrote:

Hi Ralph, it's been a long time.

The option "map-by core" does not work when pe=N > 1 is specified.
So, you should use "map-by slot:pe=N" as far as I remember.

Regards,
Tetsuya Mishima

2015/10/06 5:40:33、"users"さんは「Re: [OMPI users] Hybrid OpenMPI+OpenMP
tasks using SLURM」で書きました

Hmmm…okay, try -map-by socket:pe=4

We’ll still hit the asymmetric topology issue, but otherwise this should

work

On Oct 5, 2015, at 1:25 PM, marcin.krotkiewski

<marcin.krotkiew...@gmail.com> wrote:

Ralph,

Thank you for a fast response! Sounds very good, unfortunately I get an

error:

$ mpirun --map-by core:pe=4 ./affinity

--------------------------------------------------------------------------

A request for multiple cpus-per-proc was given, but a directive
was also give to map to an object level that cannot support that
directive.

Please specify a mapping level that has more than one cpu, or
else let us define a default mapping that will allow multiple
cpus-per-proc.

--------------------------------------------------------------------------

I have allocated my slurm job as

salloc --ntasks=2 --cpus-per-task=4

I have checked in 1.10.0 and 1.10.1rc1.




On 10/05/2015 09:58 PM, Ralph Castain wrote:

You would presently do:

mpirun —map-by core:pe=4

to get what you are seeking. If we don’t already set that qualifier

when we see “cpus_per_task”, then we probably should do so as there isn’t
any reason to make you set it twice (well, other than

trying to track which envar slurm is using now).

On Oct 5, 2015, at 12:38 PM, marcin.krotkiewski

<marcin.krotkiew...@gmail.com> wrote:

Yet another question about cpu binding under SLURM environment..

Short version: will OpenMPI support SLURM_CPUS_PER_TASK for the

purpose of cpu binding?

Full version: When you allocate a job like, e.g., this

salloc --ntasks=2 --cpus-per-task=4

SLURM will allocate 8 cores in total, 4 for each 'assumed' MPI tasks.

This is useful for hybrid jobs, where each MPI process spawns some internal
worker threads (e.g., OpenMP). The intention is

that there are 2 MPI procs started, each of them 'bound' to 4 cores.

SLURM will also set an environment variable

SLURM_CPUS_PER_TASK=4

which should (probably?) be taken into account by the method that

launches the MPI processes to figure out the cpuset. In case of OpenMPI +
mpirun I think something should happen in

orte/mca/ras/slurm/ras_slurm_module.c, where the variable _is_ actually

parsed. Unfortunately, it is never really used...

As a result, cpuset of all tasks started on a given compute node

includes all CPU cores of all MPI tasks on that node, just as provided by
SLURM (in the above example - 8). In general, there is

no simple way for the user code in the MPI procs to 'split' the cores

between themselves. I imagine the original intention to support this in
OpenMPI was something like

mpirun --bind-to subtask_cpuset

with an artificial bind target that would cause OpenMPI to divide the

allocated cores between the mpi tasks. Is this right? If so, it seems that
at this point this is not implemented. Is there

plans to do this? If no, does anyone know another way to achieve that?

Thanks a lot!

Marcin



_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:

http://www.open-mpi.org/community/lists/users/2015/10/27803.php

_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:

http://www.open-mpi.org/community/lists/users/2015/10/27804.php

_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:

http://www.open-mpi.org/community/lists/users/2015/10/27805.php

_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/usersLink to

this post: http://www.open-mpi.org/community/lists/users/2015/10/27806.php
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/10/27809.php

_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/10/27817.php

_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/10/27851.php

_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/10/27857.php

_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/10/27858.php

Re: [OMPI users] Hybrid OpenMPI+OpenMP tasks using SLURM

Reply via email to