Ben,
with respect to PBS, are both OpenMPI built the same way ?
e.g. configure --with-tm=/opt/pbs/default or something similar
you ran run
mpirun --mca plm_base_verbose 100 --mca ess_base_verbose 100 --mca
ras_base_verbose 100 hostname
and you should see the "tm" module in the logs.
i noticed you run
mpirun -np 2 ...
is there any reason why you explicitly request 2 tasks ?
that is not needed if you submit with qsub -l nodes=1:ppn=2
do you observe the same behavior without -np 2 ?
by any chance, is hyperthreading enabled on your compute node ?
/* if yes, that means all cores are in the cpuset, but with only one
thread per core */
Cheers,
Gilles
On 1/28/2016 7:57 AM, Ben Menadue wrote:
Hi,
Were there any changes to rmaps in going to 1.10.2? An otherwise-identical
setup that worked in 1.10.0 fails to launch in 1.10.2, complaining that
there's no CPUs available in a socket...
With 1.10.0:
$ /apps/openmpi/1.10.0/bin/mpirun -np 2 -mca rmaps_base_verbose 1000
hostname
[r47:18709] mca: base: components_register: registering rmaps components
[r47:18709] mca: base: components_register: found loaded component resilient
[r47:18709] mca: base: components_register: component resilient register
function successful
[r47:18709] mca: base: components_register: found loaded component rank_file
[r47:18709] mca: base: components_register: component rank_file register
function successful
[r47:18709] mca: base: components_register: found loaded component staged
[r47:18709] mca: base: components_register: component staged has no register
or open function
[r47:18709] mca: base: components_register: found loaded component ppr
[r47:18709] mca: base: components_register: component ppr register function
successful
[r47:18709] mca: base: components_register: found loaded component seq
[r47:18709] mca: base: components_register: component seq register function
successful
[r47:18709] mca: base: components_register: found loaded component
round_robin
[r47:18709] mca: base: components_register: component round_robin register
function successful
[r47:18709] mca: base: components_register: found loaded component mindist
[r47:18709] mca: base: components_register: component mindist register
function successful
[r47:18709] [[63529,0],0] rmaps:base set policy with core
[r47:18709] mca: base: components_open: opening rmaps components
[r47:18709] mca: base: components_open: found loaded component resilient
[r47:18709] mca: base: components_open: component resilient open function
successful
[r47:18709] mca: base: components_open: found loaded component rank_file
[r47:18709] mca: base: components_open: component rank_file open function
successful
[r47:18709] mca: base: components_open: found loaded component staged
[r47:18709] mca: base: components_open: component staged open function
successful
[r47:18709] mca: base: components_open: found loaded component ppr
[r47:18709] mca: base: components_open: component ppr open function
successful
[r47:18709] mca: base: components_open: found loaded component seq
[r47:18709] mca: base: components_open: component seq open function
successful
[r47:18709] mca: base: components_open: found loaded component round_robin
[r47:18709] mca: base: components_open: component round_robin open function
successful
[r47:18709] mca: base: components_open: found loaded component mindist
[r47:18709] mca: base: components_open: component mindist open function
successful
[r47:18709] mca:rmaps:select: checking available component resilient
[r47:18709] mca:rmaps:select: Querying component [resilient]
[r47:18709] mca:rmaps:select: checking available component rank_file
[r47:18709] mca:rmaps:select: Querying component [rank_file]
[r47:18709] mca:rmaps:select: checking available component staged
[r47:18709] mca:rmaps:select: Querying component [staged]
[r47:18709] mca:rmaps:select: checking available component ppr
[r47:18709] mca:rmaps:select: Querying component [ppr]
[r47:18709] mca:rmaps:select: checking available component seq
[r47:18709] mca:rmaps:select: Querying component [seq]
[r47:18709] mca:rmaps:select: checking available component round_robin
[r47:18709] mca:rmaps:select: Querying component [round_robin]
[r47:18709] mca:rmaps:select: checking available component mindist
[r47:18709] mca:rmaps:select: Querying component [mindist]
[r47:18709] [[63529,0],0]: Final mapper priorities
[r47:18709] Mapper: ppr Priority: 90
[r47:18709] Mapper: seq Priority: 60
[r47:18709] Mapper: resilient Priority: 40
[r47:18709] Mapper: mindist Priority: 20
[r47:18709] Mapper: round_robin Priority: 10
[r47:18709] Mapper: staged Priority: 5
[r47:18709] Mapper: rank_file Priority: 0
[r47:18709] mca:rmaps: mapping job [63529,1]
[r47:18709] mca:rmaps: creating new map for job [63529,1]
[r47:18709] mca:rmaps: nprocs 2
[r47:18709] mca:rmaps mapping given - using default
[r47:18709] mca:rmaps:ppr: job [63529,1] not using ppr mapper
[r47:18709] mca:rmaps:seq: job [63529,1] not using seq mapper
[r47:18709] mca:rmaps:resilient: cannot perform initial map of job [63529,1]
- no fault groups
[r47:18709] mca:rmaps:mindist: job [63529,1] not using mindist mapper
[r47:18709] mca:rmaps:rr: mapping job [63529,1]
[r47:18709] AVAILABLE NODES FOR MAPPING:
[r47:18709] node: r47 daemon: 0
[r47:18709] node: r57 daemon: 1
[r47:18709] node: r58 daemon: 2
[r47:18709] node: r59 daemon: 3
[r47:18709] mca:rmaps:rr: mapping no-span by Core for job [63529,1] slots 64
num_procs 2
[r47:18709] mca:rmaps:rr: found 16 Core objects on node r47
[r47:18709] mca:rmaps:rr: assigning proc to object 0
[r47:18709] mca:rmaps:rr: assigning proc to object 1
[r47:18709] mca:rmaps: computing ranks by core for job [63529,1]
[r47:18709] mca:rmaps:rank_by: found 16 objects on node r47 with 2 procs
[r47:18709] mca:rmaps:rank_by: assigned rank 0
[r47:18709] mca:rmaps:rank_by: assigned rank 1
[r47:18709] mca:rmaps:rank_by: found 16 objects on node r57 with 0 procs
[r47:18709] mca:rmaps:rank_by: found 16 objects on node r58 with 0 procs
[r47:18709] mca:rmaps:rank_by: found 16 objects on node r59 with 0 procs
[r47:18709] mca:rmaps: compute bindings for job [63529,1] with policy
CORE[4008]
[r47:18709] mca:rmaps: bindings for job [63529,1] - bind in place
[r47:18709] mca:rmaps: bind in place for job [63529,1] with bindings CORE
[r47:18709] [[63529,0],0] reset_usage: node r47 has 2 procs on it
[r47:18709] [[63529,0],0] reset_usage: ignoring proc [[63529,1],0]
[r47:18709] [[63529,0],0] reset_usage: ignoring proc [[63529,1],1]
[r47:18709] BINDING PROC [[63529,1],0] TO Core NUMBER 0
[r47:18709] [[63529,0],0] BOUND PROC [[63529,1],0] TO 0[Core:0] on node r47
[r47:18709] BINDING PROC [[63529,1],1] TO Core NUMBER 1
[r47:18709] [[63529,0],0] BOUND PROC [[63529,1],1] TO 1[Core:1] on node r47
r47
r47
[r47:18709] mca: base: close: component resilient closed
[r47:18709] mca: base: close: unloading component resilient
[r47:18709] mca: base: close: component rank_file closed
[r47:18709] mca: base: close: unloading component rank_file
[r47:18709] mca: base: close: component staged closed
[r47:18709] mca: base: close: unloading component staged
[r47:18709] mca: base: close: component ppr closed
[r47:18709] mca: base: close: unloading component ppr
[r47:18709] mca: base: close: component seq closed
[r47:18709] mca: base: close: unloading component seq
[r47:18709] mca: base: close: component round_robin closed
[r47:18709] mca: base: close: unloading component round_robin
[r47:18709] mca: base: close: component mindist closed
[r47:18709] mca: base: close: unloading component mindist
With 1.10.2:
$ /apps/openmpi/1.10.2/bin/mpirun -np 2 -mca rmaps_base_verbose 1000
hostname
[r47:18733] mca: base: components_register: registering rmaps components
[r47:18733] mca: base: components_register: found loaded component resilient
[r47:18733] mca: base: components_register: component resilient register
function successful
[r47:18733] mca: base: components_register: found loaded component rank_file
[r47:18733] mca: base: components_register: component rank_file register
function successful
[r47:18733] mca: base: components_register: found loaded component staged
[r47:18733] mca: base: components_register: component staged has no register
or open function
[r47:18733] mca: base: components_register: found loaded component ppr
[r47:18733] mca: base: components_register: component ppr register function
successful
[r47:18733] mca: base: components_register: found loaded component seq
[r47:18733] mca: base: components_register: component seq register function
successful
[r47:18733] mca: base: components_register: found loaded component
round_robin
[r47:18733] mca: base: components_register: component round_robin register
function successful
[r47:18733] mca: base: components_register: found loaded component mindist
[r47:18733] mca: base: components_register: component mindist register
function successful
[r47:18733] [[63505,0],0] rmaps:base set policy with core
[r47:18733] mca: base: components_open: opening rmaps components
[r47:18733] mca: base: components_open: found loaded component resilient
[r47:18733] mca: base: components_open: component resilient open function
successful
[r47:18733] mca: base: components_open: found loaded component rank_file
[r47:18733] mca: base: components_open: component rank_file open function
successful
[r47:18733] mca: base: components_open: found loaded component staged
[r47:18733] mca: base: components_open: component staged open function
successful
[r47:18733] mca: base: components_open: found loaded component ppr
[r47:18733] mca: base: components_open: component ppr open function
successful
[r47:18733] mca: base: components_open: found loaded component seq
[r47:18733] mca: base: components_open: component seq open function
successful
[r47:18733] mca: base: components_open: found loaded component round_robin
[r47:18733] mca: base: components_open: component round_robin open function
successful
[r47:18733] mca: base: components_open: found loaded component mindist
[r47:18733] mca: base: components_open: component mindist open function
successful
[r47:18733] mca:rmaps:select: checking available component resilient
[r47:18733] mca:rmaps:select: Querying component [resilient]
[r47:18733] mca:rmaps:select: checking available component rank_file
[r47:18733] mca:rmaps:select: Querying component [rank_file]
[r47:18733] mca:rmaps:select: checking available component staged
[r47:18733] mca:rmaps:select: Querying component [staged]
[r47:18733] mca:rmaps:select: checking available component ppr
[r47:18733] mca:rmaps:select: Querying component [ppr]
[r47:18733] mca:rmaps:select: checking available component seq
[r47:18733] mca:rmaps:select: Querying component [seq]
[r47:18733] mca:rmaps:select: checking available component round_robin
[r47:18733] mca:rmaps:select: Querying component [round_robin]
[r47:18733] mca:rmaps:select: checking available component mindist
[r47:18733] mca:rmaps:select: Querying component [mindist]
[r47:18733] [[63505,0],0]: Final mapper priorities
[r47:18733] Mapper: ppr Priority: 90
[r47:18733] Mapper: seq Priority: 60
[r47:18733] Mapper: resilient Priority: 40
[r47:18733] Mapper: mindist Priority: 20
[r47:18733] Mapper: round_robin Priority: 10
[r47:18733] Mapper: staged Priority: 5
[r47:18733] Mapper: rank_file Priority: 0
[r47:18733] mca:rmaps: mapping job [63505,1]
[r47:18733] mca:rmaps: creating new map for job [63505,1]
[r47:18733] mca:rmaps: nprocs 2
[r47:18733] mca:rmaps mapping given - using default
[r47:18733] mca:rmaps:ppr: job [63505,1] not using ppr mapper
[r47:18733] mca:rmaps:seq: job [63505,1] not using seq mapper
[r47:18733] mca:rmaps:resilient: cannot perform initial map of job [63505,1]
- no fault groups
[r47:18733] mca:rmaps:mindist: job [63505,1] not using mindist mapper
[r47:18733] mca:rmaps:rr: mapping job [63505,1]
[r47:18733] AVAILABLE NODES FOR MAPPING:
[r47:18733] node: r47 daemon: 0
[r47:18733] node: r57 daemon: 1
[r47:18733] node: r58 daemon: 2
[r47:18733] node: r59 daemon: 3
[r47:18733] mca:rmaps:rr: mapping no-span by Core for job [63505,1] slots 64
num_procs 2
[r47:18733] mca:rmaps:rr: found 16 Core objects on node r47
[r47:18733] mca:rmaps:rr: assigning proc to object 0
--------------------------------------------------------------------------
A request for multiple cpus-per-proc was given, but a directive
was also give to map to an object level that has less cpus than
requested ones:
#cpus-per-proc: 1
number of cpus: 0
map-by: BYCORE:NOOVERSUBSCRIBE
Please specify a mapping level that has more cpus, or else let us
define a default mapping that will allow multiple cpus-per-proc.
--------------------------------------------------------------------------
[r47:18733] mca: base: close: component resilient closed
[r47:18733] mca: base: close: unloading component resilient
[r47:18733] mca: base: close: component rank_file closed
[r47:18733] mca: base: close: unloading component rank_file
[r47:18733] mca: base: close: component staged closed
[r47:18733] mca: base: close: unloading component staged
[r47:18733] mca: base: close: component ppr closed
[r47:18733] mca: base: close: unloading component ppr
[r47:18733] mca: base: close: component seq closed
[r47:18733] mca: base: close: unloading component seq
[r47:18733] mca: base: close: component round_robin closed
[r47:18733] mca: base: close: unloading component round_robin
[r47:18733] mca: base: close: component mindist closed
[r47:18733] mca: base: close: unloading component mindist
There are both in the same PBS Pro job. And the cpuset definitely has all
cores available:
$ cat /cgroup/cpuset/pbspro/4347646.r-man2/cpuset.cpus
0-15
Is there something here I'm missing?
Cheers,
Ben
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2016/01/28393.php