Yes, I'm able to reproduce it on a single node as well.

Actually, even on just a single CPU (and -np 1) - won't let me launch unless
both threads of that core are in the cgroup.


-----Original Message-----
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Gilles
Gouaillardet
Sent: Friday, 29 January 2016 1:33 PM
To: Open MPI Users <us...@open-mpi.org>
Subject: Re: [OMPI users] Any changes to rmaps in 1.10.2?

I was able to reproduce the issue on one node with a cpuset manually set.

fwiw, i cannot reproduce the issue using taskset instead of cpuset (!)

Cheers,

Gilles

On 1/29/2016 11:08 AM, Ben Menadue wrote:
> Hi Gilles, Ralph,
>
> Okay, it definitely seems to be due to the cpuset having only one of 
> the hyperthreads of each physical core:
>
>
> [13:02:13 root@r60:4363542.r-man2] # echo 0-15 > cpuset.cpus
>
> 13:03 bjm900@r60 ~ > cat 
> /cgroup/cpuset/pbspro/4363542.r-man2/cpuset.cpus
> 0-15
>
> 13:03 bjm900@r60 ~ > /apps/openmpi/1.10.2/bin/mpirun  hostname
> ----------------------------------------------------------------------
> ---- A request for multiple cpus-per-proc was given, but a directive 
> was also give to map to an object level that has less cpus than 
> requested ones:
>
>    #cpus-per-proc:  1
>    number of cpus:  0
>    map-by:          BYCORE:NOOVERSUBSCRIBE
>
> Please specify a mapping level that has more cpus, or else let us 
> define a default mapping that will allow multiple cpus-per-proc.
> ----------------------------------------------------------------------
> ----
>
> [13:03:43 root@r60:4363542.r-man2] # echo 0-31 > cpuset.cpus
>
> 13:03 bjm900@r60 ~ > cat 
> /cgroup/cpuset/pbspro/4363542.r-man2/cpuset.cpus
> 0-31
>
> 13:04 bjm900@r60 ~ > /apps/openmpi/1.10.2/bin/mpirun  hostname 
> <...hostnames...>
>
>
> Cheers,
> Ben
>
>
>
> -----Original Message-----
> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ben 
> Menadue
> Sent: Friday, 29 January 2016 1:01 PM
> To: 'Open MPI Users' <us...@open-mpi.org>
> Subject: Re: [OMPI users] Any changes to rmaps in 1.10.2?
>
> Hi Gilles,
>
>> with respect to PBS, are both OpenMPI built the same way ?
>> e.g. configure --with-tm=/opt/pbs/default or something similar
> Both are built against TM explicitly using the --with-tm option.
>
>> you ran run
>> mpirun --mca plm_base_verbose 100 --mca ess_base_verbose 100 --mca
> ras_base_verbose 100 hostname
>> and you should see the "tm" module in the logs.
> Yes, it appears to use TM from what I can see. Outputs from 1.10.0 and
> 1.10.2 are attached from inside the same job - they look identical 
> (apart from the pids), except at the very end where 1.10.2 errors out 
> while 1.10.1 continues.
>
>> i noticed you run
>> mpirun -np 2 ...
>> is there any reason why you explicitly request 2 tasks ?
> The "-np 2" is because that's what I was using to benchmark the 
> install
> (osu_bibw) and I just copied it over from when I realised it wasn't 
> even starting. But it does the same regardless of whether I specify 
> the number of processes or not (without it gets the number of tasks from
PBS).
>
>> by any chance, is hyperthreading enabled on your compute node ?
>> /* if yes, that means all cores are in the cpuset, but with only one
> thread per core */
>
> The nodes are 2 x 8-core sockets with hyper-threading on, and you can 
> chose whether to use the extra hardware threads when submitting the 
> job. If you want them, your cgroup includes both threads on each core 
> (e.g. 0-31), otherwise only one thread (e.g. 0-15) (cores 16-32 are 
> the thread siblings of cores 0-15).
>
> For reference, the PBS job I was using above had ncpus=32,mem=16G, 
> which becomes
>
>    select=2:ncpus=16:mpiprocs=16:mem=8589934592b
>
> under the hood with a cpuset containing cores 0-15 on each of the two
nodes.
>
> Interestingly, if I use a cpuset containing both threads of each 
> physical core (i.e. ask for hyperthreading on job submission), then it 
> runs fine under 1.10.2.
>
> Cheers,
> Ben
>
>
> -----Original Message-----
> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Gilles 
> Gouaillardet
> Sent: Friday, 29 January 2016 11:07 AM
> To: Open MPI Users <us...@open-mpi.org>
> Subject: Re: [OMPI users] Any changes to rmaps in 1.10.2?
>
> Ben,
>
>
>
> that is not needed if you submit with qsub -l nodes=1:ppn=2 do you 
> observe the same behavior without -np 2 ?
>
>
> Cheers,
>
> Gilles
>
> On 1/28/2016 7:57 AM, Ben Menadue wrote:
>> Hi,
>>
>> Were there any changes to rmaps in going to 1.10.2? An 
>> otherwise-identical setup that worked in 1.10.0 fails to launch in 
>> 1.10.2, complaining that there's no CPUs available in a socket...
>>
>> With 1.10.0:
>>
>> $ /apps/openmpi/1.10.0/bin/mpirun -np 2 -mca rmaps_base_verbose 1000 
>> hostname [r47:18709] mca: base: components_register: registering 
>> rmaps components [r47:18709] mca: base: components_register: found 
>> loaded component resilient [r47:18709] mca: base: components_register:
>> component resilient register function successful [r47:18709] mca:
>> base: components_register: found loaded component rank_file 
>> [r47:18709] mca: base: components_register: component rank_file 
>> register function successful [r47:18709] mca: base:
>> components_register: found loaded component staged [r47:18709] mca:
>> base: components_register: component staged has no register or open 
>> function [r47:18709] mca: base: components_register: found loaded 
>> component ppr [r47:18709] mca: base: components_register: component 
>> ppr register function successful [r47:18709] mca: base:
>> components_register: found loaded component seq [r47:18709] mca: base:
>> components_register: component seq register function successful 
>> [r47:18709] mca: base: components_register: found loaded component 
>> round_robin [r47:18709] mca: base: components_register: component 
>> round_robin register function successful [r47:18709] mca: base:
>> components_register: found loaded component mindist [r47:18709] mca:
>> base: components_register: component mindist register function 
>> successful [r47:18709] [[63529,0],0] rmaps:base set policy with core 
>> [r47:18709] mca: base: components_open: opening rmaps components 
>> [r47:18709] mca: base: components_open: found loaded component 
>> resilient [r47:18709] mca: base: components_open: component resilient 
>> open function successful [r47:18709] mca: base: components_open: 
>> found loaded component rank_file [r47:18709] mca: base: components_open:
>> component rank_file open function successful [r47:18709] mca: base:
>> components_open: found loaded component staged [r47:18709] mca: base:
>> components_open: component staged open function successful 
>> [r47:18709]
>> mca: base: components_open: found loaded component ppr [r47:18709]
>> mca: base: components_open: component ppr open function successful 
>> [r47:18709] mca: base: components_open: found loaded component seq 
>> [r47:18709] mca: base: components_open: component seq open function 
>> successful [r47:18709] mca: base: components_open: found loaded 
>> component round_robin [r47:18709] mca: base: components_open:
>> component round_robin open function successful [r47:18709] mca: base:
>> components_open: found loaded component mindist [r47:18709] mca: base:
>> components_open: component mindist open function successful 
>> [r47:18709] mca:rmaps:select: checking available component resilient 
>> [r47:18709] mca:rmaps:select: Querying component [resilient] 
>> [r47:18709] mca:rmaps:select: checking available component rank_file 
>> [r47:18709] mca:rmaps:select: Querying component [rank_file] 
>> [r47:18709] mca:rmaps:select: checking available component staged 
>> [r47:18709] mca:rmaps:select: Querying component [staged] [r47:18709]
>> mca:rmaps:select: checking available component ppr [r47:18709]
>> mca:rmaps:select: Querying component [ppr] [r47:18709]
>> mca:rmaps:select: checking available component seq [r47:18709]
>> mca:rmaps:select: Querying component [seq] [r47:18709]
>> mca:rmaps:select: checking available component round_robin 
>> [r47:18709]
>> mca:rmaps:select: Querying component [round_robin] [r47:18709]
>> mca:rmaps:select: checking available component mindist [r47:18709]
>> mca:rmaps:select: Querying component [mindist] [r47:18709]
>> [[63529,0],0]: Final mapper priorities
>> [r47:18709]  Mapper: ppr Priority: 90
>> [r47:18709]  Mapper: seq Priority: 60
>> [r47:18709]  Mapper: resilient Priority: 40
>> [r47:18709]  Mapper: mindist Priority: 20
>> [r47:18709]  Mapper: round_robin Priority: 10
>> [r47:18709]  Mapper: staged Priority: 5
>> [r47:18709]  Mapper: rank_file Priority: 0
>> [r47:18709] mca:rmaps: mapping job [63529,1] [r47:18709] mca:rmaps:
>> creating new map for job [63529,1] [r47:18709] mca:rmaps: nprocs 2 
>> [r47:18709] mca:rmaps mapping given - using default [r47:18709]
>> mca:rmaps:ppr: job [63529,1] not using ppr mapper [r47:18709]
>> mca:rmaps:seq: job [63529,1] not using seq mapper [r47:18709]
>> mca:rmaps:resilient: cannot perform initial map of job [63529,1]
>> - no fault groups
>> [r47:18709] mca:rmaps:mindist: job [63529,1] not using mindist mapper 
>> [r47:18709] mca:rmaps:rr: mapping job [63529,1] [r47:18709] AVAILABLE 
>> NODES FOR MAPPING:
>> [r47:18709]     node: r47 daemon: 0
>> [r47:18709]     node: r57 daemon: 1
>> [r47:18709]     node: r58 daemon: 2
>> [r47:18709]     node: r59 daemon: 3
>> [r47:18709] mca:rmaps:rr: mapping no-span by Core for job [63529,1] 
>> slots 64 num_procs 2 [r47:18709] mca:rmaps:rr: found 16 Core objects 
>> on node r47 [r47:18709] mca:rmaps:rr: assigning proc to object 0 
>> [r47:18709] mca:rmaps:rr: assigning proc to object 1 [r47:18709]
>> mca:rmaps: computing ranks by core for job [63529,1] [r47:18709]
>> mca:rmaps:rank_by: found 16 objects on node r47 with 2 procs 
>> [r47:18709] mca:rmaps:rank_by: assigned rank 0 [r47:18709]
>> mca:rmaps:rank_by: assigned rank 1 [r47:18709] mca:rmaps:rank_by:
>> found 16 objects on node r57 with 0 procs [r47:18709]
>> mca:rmaps:rank_by: found 16 objects on node r58 with 0 procs 
>> [r47:18709] mca:rmaps:rank_by: found 16 objects on node r59 with 0 
>> procs [r47:18709] mca:rmaps: compute bindings for job [63529,1] with 
>> policy CORE[4008] [r47:18709] mca:rmaps: bindings for job [63529,1] - 
>> bind in place [r47:18709] mca:rmaps: bind in place for job [63529,1] 
>> with bindings CORE [r47:18709] [[63529,0],0] reset_usage: node r47 
>> has
>> 2 procs on it [r47:18709] [[63529,0],0] reset_usage: ignoring proc 
>> [[63529,1],0] [r47:18709] [[63529,0],0] reset_usage: ignoring proc 
>> [[63529,1],1] [r47:18709] BINDING PROC [[63529,1],0] TO Core NUMBER 0 
>> [r47:18709] [[63529,0],0] BOUND PROC [[63529,1],0] TO 0[Core:0] on 
>> node r47 [r47:18709] BINDING PROC [[63529,1],1] TO Core NUMBER 1 
>> [r47:18709] [[63529,0],0] BOUND PROC [[63529,1],1] TO 1[Core:1] on 
>> node r47
>> r47
>> r47
>> [r47:18709] mca: base: close: component resilient closed [r47:18709]
>> mca: base: close: unloading component resilient [r47:18709] mca: base:
>> close: component rank_file closed [r47:18709] mca: base: close:
>> unloading component rank_file [r47:18709] mca: base: close: component 
>> staged closed [r47:18709] mca: base: close: unloading component 
>> staged [r47:18709] mca: base: close: component ppr closed [r47:18709]
mca:
>> base: close: unloading component ppr [r47:18709] mca: base: close:
>> component seq closed [r47:18709] mca: base: close: unloading 
>> component seq [r47:18709] mca: base: close: component round_robin 
>> closed [r47:18709] mca: base: close: unloading component round_robin 
>> [r47:18709] mca: base: close: component mindist closed [r47:18709]
>> mca: base: close: unloading component mindist
>>
>> With 1.10.2:
>>
>> $ /apps/openmpi/1.10.2/bin/mpirun -np 2 -mca rmaps_base_verbose 1000 
>> hostname [r47:18733] mca: base: components_register: registering 
>> rmaps components [r47:18733] mca: base: components_register: found 
>> loaded component resilient [r47:18733] mca: base: components_register:
>> component resilient register function successful [r47:18733] mca:
>> base: components_register: found loaded component rank_file 
>> [r47:18733] mca: base: components_register: component rank_file 
>> register function successful [r47:18733] mca: base:
>> components_register: found loaded component staged [r47:18733] mca:
>> base: components_register: component staged has no register or open 
>> function [r47:18733] mca: base: components_register: found loaded 
>> component ppr [r47:18733] mca: base: components_register: component 
>> ppr register function successful [r47:18733] mca: base:
>> components_register: found loaded component seq [r47:18733] mca: base:
>> components_register: component seq register function successful 
>> [r47:18733] mca: base: components_register: found loaded component 
>> round_robin [r47:18733] mca: base: components_register: component 
>> round_robin register function successful [r47:18733] mca: base:
>> components_register: found loaded component mindist [r47:18733] mca:
>> base: components_register: component mindist register function 
>> successful [r47:18733] [[63505,0],0] rmaps:base set policy with core 
>> [r47:18733] mca: base: components_open: opening rmaps components 
>> [r47:18733] mca: base: components_open: found loaded component 
>> resilient [r47:18733] mca: base: components_open: component resilient 
>> open function successful [r47:18733] mca: base: components_open: 
>> found loaded component rank_file [r47:18733] mca: base: components_open:
>> component rank_file open function successful [r47:18733] mca: base:
>> components_open: found loaded component staged [r47:18733] mca: base:
>> components_open: component staged open function successful 
>> [r47:18733]
>> mca: base: components_open: found loaded component ppr [r47:18733]
>> mca: base: components_open: component ppr open function successful 
>> [r47:18733] mca: base: components_open: found loaded component seq 
>> [r47:18733] mca: base: components_open: component seq open function 
>> successful [r47:18733] mca: base: components_open: found loaded 
>> component round_robin [r47:18733] mca: base: components_open:
>> component round_robin open function successful [r47:18733] mca: base:
>> components_open: found loaded component mindist [r47:18733] mca: base:
>> components_open: component mindist open function successful 
>> [r47:18733] mca:rmaps:select: checking available component resilient 
>> [r47:18733] mca:rmaps:select: Querying component [resilient] 
>> [r47:18733] mca:rmaps:select: checking available component rank_file 
>> [r47:18733] mca:rmaps:select: Querying component [rank_file] 
>> [r47:18733] mca:rmaps:select: checking available component staged 
>> [r47:18733] mca:rmaps:select: Querying component [staged] [r47:18733]
>> mca:rmaps:select: checking available component ppr [r47:18733]
>> mca:rmaps:select: Querying component [ppr] [r47:18733]
>> mca:rmaps:select: checking available component seq [r47:18733]
>> mca:rmaps:select: Querying component [seq] [r47:18733]
>> mca:rmaps:select: checking available component round_robin 
>> [r47:18733]
>> mca:rmaps:select: Querying component [round_robin] [r47:18733]
>> mca:rmaps:select: checking available component mindist [r47:18733]
>> mca:rmaps:select: Querying component [mindist] [r47:18733]
>> [[63505,0],0]: Final mapper priorities
>> [r47:18733]  Mapper: ppr Priority: 90
>> [r47:18733]  Mapper: seq Priority: 60
>> [r47:18733]  Mapper: resilient Priority: 40
>> [r47:18733]  Mapper: mindist Priority: 20
>> [r47:18733]  Mapper: round_robin Priority: 10
>> [r47:18733]  Mapper: staged Priority: 5
>> [r47:18733]  Mapper: rank_file Priority: 0
>> [r47:18733] mca:rmaps: mapping job [63505,1] [r47:18733] mca:rmaps:
>> creating new map for job [63505,1] [r47:18733] mca:rmaps: nprocs 2 
>> [r47:18733] mca:rmaps mapping given - using default [r47:18733]
>> mca:rmaps:ppr: job [63505,1] not using ppr mapper [r47:18733]
>> mca:rmaps:seq: job [63505,1] not using seq mapper [r47:18733]
>> mca:rmaps:resilient: cannot perform initial map of job [63505,1]
>> - no fault groups
>> [r47:18733] mca:rmaps:mindist: job [63505,1] not using mindist mapper 
>> [r47:18733] mca:rmaps:rr: mapping job [63505,1] [r47:18733] AVAILABLE 
>> NODES FOR MAPPING:
>> [r47:18733]     node: r47 daemon: 0
>> [r47:18733]     node: r57 daemon: 1
>> [r47:18733]     node: r58 daemon: 2
>> [r47:18733]     node: r59 daemon: 3
>> [r47:18733] mca:rmaps:rr: mapping no-span by Core for job [63505,1] 
>> slots 64 num_procs 2 [r47:18733] mca:rmaps:rr: found 16 Core objects 
>> on node r47 [r47:18733] mca:rmaps:rr: assigning proc to object 0
>> ---------------------------------------------------------------------
>> -
>> ---- A request for multiple cpus-per-proc was given, but a directive 
>> was also give to map to an object level that has less cpus than 
>> requested ones:
>>
>>     #cpus-per-proc:  1
>>     number of cpus:  0
>>     map-by:          BYCORE:NOOVERSUBSCRIBE
>>
>> Please specify a mapping level that has more cpus, or else let us 
>> define a default mapping that will allow multiple cpus-per-proc.
>> ---------------------------------------------------------------------
>> -
>> ---- [r47:18733] mca: base: close: component resilient closed 
>> [r47:18733] mca: base: close: unloading component resilient 
>> [r47:18733] mca: base: close: component rank_file closed [r47:18733]
>> mca: base: close: unloading component rank_file [r47:18733] mca: base:
>> close: component staged closed [r47:18733] mca: base: close: 
>> unloading component staged [r47:18733] mca: base: close: component 
>> ppr closed [r47:18733] mca: base: close: unloading component ppr
[r47:18733] mca:
>> base: close: component seq closed [r47:18733] mca: base: close:
>> unloading component seq [r47:18733] mca: base: close: component 
>> round_robin closed [r47:18733] mca: base: close: unloading component 
>> round_robin [r47:18733] mca: base: close: component mindist closed 
>> [r47:18733] mca: base: close: unloading component mindist
>>
>> There are both in the same PBS Pro job. And the cpuset definitely has 
>> all cores available:
>>
>> $ cat /cgroup/cpuset/pbspro/4347646.r-man2/cpuset.cpus
>> 0-15
>>
>> Is there something here I'm missing?
>>
>> Cheers,
>> Ben
>>
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2016/01/28393.php
>>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/01/28400.php
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2016/01/28402.php
>

_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2016/01/28404.php

Reply via email to