Ben,

what is the minimum number of nodes required to reproduce the issue ?
e.g. can you reproduce it with one node ?

Cheers,

Gilles

On 1/29/2016 11:00 AM, Ben Menadue wrote:
Hi Gilles,

with respect to PBS, are both OpenMPI built the same way ?
e.g. configure --with-tm=/opt/pbs/default or something similar
Both are built against TM explicitly using the --with-tm option.

you ran run
mpirun --mca plm_base_verbose 100 --mca ess_base_verbose 100 --mca
ras_base_verbose 100 hostname
and you should see the "tm" module in the logs.
Yes, it appears to use TM from what I can see. Outputs from 1.10.0 and
1.10.2 are attached from inside the same job - they look identical (apart
from the pids), except at the very end where 1.10.2 errors out while 1.10.1
continues.

i noticed you run
mpirun -np 2 ...
is there any reason why you explicitly request 2 tasks ?
The "-np 2" is because that's what I was using to benchmark the install
(osu_bibw) and I just copied it over from when I realised it wasn't even
starting. But it does the same regardless of whether I specify the number of
processes or not (without it gets the number of tasks from PBS).

by any chance, is hyperthreading enabled on your compute node ?
/* if yes, that means all cores are in the cpuset, but with only one
thread per core */

The nodes are 2 x 8-core sockets with hyper-threading on, and you can chose
whether to use the extra hardware threads when submitting the job. If you
want them, your cgroup includes both threads on each core (e.g. 0-31),
otherwise only one thread (e.g. 0-15) (cores 16-32 are the thread siblings
of cores 0-15).

For reference, the PBS job I was using above had ncpus=32,mem=16G, which
becomes

   select=2:ncpus=16:mpiprocs=16:mem=8589934592b

under the hood with a cpuset containing cores 0-15 on each of the two nodes.

Interestingly, if I use a cpuset containing both threads of each physical
core (i.e. ask for hyperthreading on job submission), then it runs fine
under 1.10.2.

Cheers,
Ben


-----Original Message-----
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Gilles
Gouaillardet
Sent: Friday, 29 January 2016 11:07 AM
To: Open MPI Users <us...@open-mpi.org>
Subject: Re: [OMPI users] Any changes to rmaps in 1.10.2?

Ben,



that is not needed if you submit with qsub -l nodes=1:ppn=2 do you observe
the same behavior without -np 2 ?


Cheers,

Gilles

On 1/28/2016 7:57 AM, Ben Menadue wrote:
Hi,

Were there any changes to rmaps in going to 1.10.2? An
otherwise-identical setup that worked in 1.10.0 fails to launch in
1.10.2, complaining that there's no CPUs available in a socket...

With 1.10.0:

$ /apps/openmpi/1.10.0/bin/mpirun -np 2 -mca rmaps_base_verbose 1000
hostname [r47:18709] mca: base: components_register: registering rmaps
components [r47:18709] mca: base: components_register: found loaded
component resilient [r47:18709] mca: base: components_register:
component resilient register function successful [r47:18709] mca:
base: components_register: found loaded component rank_file
[r47:18709] mca: base: components_register: component rank_file
register function successful [r47:18709] mca: base:
components_register: found loaded component staged [r47:18709] mca:
base: components_register: component staged has no register or open
function [r47:18709] mca: base: components_register: found loaded
component ppr [r47:18709] mca: base: components_register: component
ppr register function successful [r47:18709] mca: base:
components_register: found loaded component seq [r47:18709] mca: base:
components_register: component seq register function successful
[r47:18709] mca: base: components_register: found loaded component
round_robin [r47:18709] mca: base: components_register: component
round_robin register function successful [r47:18709] mca: base:
components_register: found loaded component mindist [r47:18709] mca:
base: components_register: component mindist register function
successful [r47:18709] [[63529,0],0] rmaps:base set policy with core
[r47:18709] mca: base: components_open: opening rmaps components
[r47:18709] mca: base: components_open: found loaded component
resilient [r47:18709] mca: base: components_open: component resilient
open function successful [r47:18709] mca: base: components_open: found
loaded component rank_file [r47:18709] mca: base: components_open:
component rank_file open function successful [r47:18709] mca: base:
components_open: found loaded component staged [r47:18709] mca: base:
components_open: component staged open function successful [r47:18709]
mca: base: components_open: found loaded component ppr [r47:18709]
mca: base: components_open: component ppr open function successful
[r47:18709] mca: base: components_open: found loaded component seq
[r47:18709] mca: base: components_open: component seq open function
successful [r47:18709] mca: base: components_open: found loaded
component round_robin [r47:18709] mca: base: components_open:
component round_robin open function successful [r47:18709] mca: base:
components_open: found loaded component mindist [r47:18709] mca: base:
components_open: component mindist open function successful
[r47:18709] mca:rmaps:select: checking available component resilient
[r47:18709] mca:rmaps:select: Querying component [resilient]
[r47:18709] mca:rmaps:select: checking available component rank_file
[r47:18709] mca:rmaps:select: Querying component [rank_file]
[r47:18709] mca:rmaps:select: checking available component staged
[r47:18709] mca:rmaps:select: Querying component [staged] [r47:18709]
mca:rmaps:select: checking available component ppr [r47:18709]
mca:rmaps:select: Querying component [ppr] [r47:18709]
mca:rmaps:select: checking available component seq [r47:18709]
mca:rmaps:select: Querying component [seq] [r47:18709]
mca:rmaps:select: checking available component round_robin [r47:18709]
mca:rmaps:select: Querying component [round_robin] [r47:18709]
mca:rmaps:select: checking available component mindist [r47:18709]
mca:rmaps:select: Querying component [mindist] [r47:18709]
[[63529,0],0]: Final mapper priorities
[r47:18709]     Mapper: ppr Priority: 90
[r47:18709]     Mapper: seq Priority: 60
[r47:18709]     Mapper: resilient Priority: 40
[r47:18709]     Mapper: mindist Priority: 20
[r47:18709]     Mapper: round_robin Priority: 10
[r47:18709]     Mapper: staged Priority: 5
[r47:18709]     Mapper: rank_file Priority: 0
[r47:18709] mca:rmaps: mapping job [63529,1] [r47:18709] mca:rmaps:
creating new map for job [63529,1] [r47:18709] mca:rmaps: nprocs 2
[r47:18709] mca:rmaps mapping given - using default [r47:18709]
mca:rmaps:ppr: job [63529,1] not using ppr mapper [r47:18709]
mca:rmaps:seq: job [63529,1] not using seq mapper [r47:18709]
mca:rmaps:resilient: cannot perform initial map of job [63529,1]
- no fault groups
[r47:18709] mca:rmaps:mindist: job [63529,1] not using mindist mapper
[r47:18709] mca:rmaps:rr: mapping job [63529,1] [r47:18709] AVAILABLE
NODES FOR MAPPING:
[r47:18709]     node: r47 daemon: 0
[r47:18709]     node: r57 daemon: 1
[r47:18709]     node: r58 daemon: 2
[r47:18709]     node: r59 daemon: 3
[r47:18709] mca:rmaps:rr: mapping no-span by Core for job [63529,1]
slots 64 num_procs 2 [r47:18709] mca:rmaps:rr: found 16 Core objects
on node r47 [r47:18709] mca:rmaps:rr: assigning proc to object 0
[r47:18709] mca:rmaps:rr: assigning proc to object 1 [r47:18709]
mca:rmaps: computing ranks by core for job [63529,1] [r47:18709]
mca:rmaps:rank_by: found 16 objects on node r47 with 2 procs
[r47:18709] mca:rmaps:rank_by: assigned rank 0 [r47:18709]
mca:rmaps:rank_by: assigned rank 1 [r47:18709] mca:rmaps:rank_by:
found 16 objects on node r57 with 0 procs [r47:18709]
mca:rmaps:rank_by: found 16 objects on node r58 with 0 procs
[r47:18709] mca:rmaps:rank_by: found 16 objects on node r59 with 0
procs [r47:18709] mca:rmaps: compute bindings for job [63529,1] with
policy CORE[4008] [r47:18709] mca:rmaps: bindings for job [63529,1] -
bind in place [r47:18709] mca:rmaps: bind in place for job [63529,1]
with bindings CORE [r47:18709] [[63529,0],0] reset_usage: node r47 has
2 procs on it [r47:18709] [[63529,0],0] reset_usage: ignoring proc
[[63529,1],0] [r47:18709] [[63529,0],0] reset_usage: ignoring proc
[[63529,1],1] [r47:18709] BINDING PROC [[63529,1],0] TO Core NUMBER 0
[r47:18709] [[63529,0],0] BOUND PROC [[63529,1],0] TO 0[Core:0] on
node r47 [r47:18709] BINDING PROC [[63529,1],1] TO Core NUMBER 1
[r47:18709] [[63529,0],0] BOUND PROC [[63529,1],1] TO 1[Core:1] on
node r47
r47
r47
[r47:18709] mca: base: close: component resilient closed [r47:18709]
mca: base: close: unloading component resilient [r47:18709] mca: base:
close: component rank_file closed [r47:18709] mca: base: close:
unloading component rank_file [r47:18709] mca: base: close: component
staged closed [r47:18709] mca: base: close: unloading component staged
[r47:18709] mca: base: close: component ppr closed [r47:18709] mca:
base: close: unloading component ppr [r47:18709] mca: base: close:
component seq closed [r47:18709] mca: base: close: unloading component
seq [r47:18709] mca: base: close: component round_robin closed
[r47:18709] mca: base: close: unloading component round_robin
[r47:18709] mca: base: close: component mindist closed [r47:18709]
mca: base: close: unloading component mindist

With 1.10.2:

$ /apps/openmpi/1.10.2/bin/mpirun -np 2 -mca rmaps_base_verbose 1000
hostname [r47:18733] mca: base: components_register: registering rmaps
components [r47:18733] mca: base: components_register: found loaded
component resilient [r47:18733] mca: base: components_register:
component resilient register function successful [r47:18733] mca:
base: components_register: found loaded component rank_file
[r47:18733] mca: base: components_register: component rank_file
register function successful [r47:18733] mca: base:
components_register: found loaded component staged [r47:18733] mca:
base: components_register: component staged has no register or open
function [r47:18733] mca: base: components_register: found loaded
component ppr [r47:18733] mca: base: components_register: component
ppr register function successful [r47:18733] mca: base:
components_register: found loaded component seq [r47:18733] mca: base:
components_register: component seq register function successful
[r47:18733] mca: base: components_register: found loaded component
round_robin [r47:18733] mca: base: components_register: component
round_robin register function successful [r47:18733] mca: base:
components_register: found loaded component mindist [r47:18733] mca:
base: components_register: component mindist register function
successful [r47:18733] [[63505,0],0] rmaps:base set policy with core
[r47:18733] mca: base: components_open: opening rmaps components
[r47:18733] mca: base: components_open: found loaded component
resilient [r47:18733] mca: base: components_open: component resilient
open function successful [r47:18733] mca: base: components_open: found
loaded component rank_file [r47:18733] mca: base: components_open:
component rank_file open function successful [r47:18733] mca: base:
components_open: found loaded component staged [r47:18733] mca: base:
components_open: component staged open function successful [r47:18733]
mca: base: components_open: found loaded component ppr [r47:18733]
mca: base: components_open: component ppr open function successful
[r47:18733] mca: base: components_open: found loaded component seq
[r47:18733] mca: base: components_open: component seq open function
successful [r47:18733] mca: base: components_open: found loaded
component round_robin [r47:18733] mca: base: components_open:
component round_robin open function successful [r47:18733] mca: base:
components_open: found loaded component mindist [r47:18733] mca: base:
components_open: component mindist open function successful
[r47:18733] mca:rmaps:select: checking available component resilient
[r47:18733] mca:rmaps:select: Querying component [resilient]
[r47:18733] mca:rmaps:select: checking available component rank_file
[r47:18733] mca:rmaps:select: Querying component [rank_file]
[r47:18733] mca:rmaps:select: checking available component staged
[r47:18733] mca:rmaps:select: Querying component [staged] [r47:18733]
mca:rmaps:select: checking available component ppr [r47:18733]
mca:rmaps:select: Querying component [ppr] [r47:18733]
mca:rmaps:select: checking available component seq [r47:18733]
mca:rmaps:select: Querying component [seq] [r47:18733]
mca:rmaps:select: checking available component round_robin [r47:18733]
mca:rmaps:select: Querying component [round_robin] [r47:18733]
mca:rmaps:select: checking available component mindist [r47:18733]
mca:rmaps:select: Querying component [mindist] [r47:18733]
[[63505,0],0]: Final mapper priorities
[r47:18733]     Mapper: ppr Priority: 90
[r47:18733]     Mapper: seq Priority: 60
[r47:18733]     Mapper: resilient Priority: 40
[r47:18733]     Mapper: mindist Priority: 20
[r47:18733]     Mapper: round_robin Priority: 10
[r47:18733]     Mapper: staged Priority: 5
[r47:18733]     Mapper: rank_file Priority: 0
[r47:18733] mca:rmaps: mapping job [63505,1] [r47:18733] mca:rmaps:
creating new map for job [63505,1] [r47:18733] mca:rmaps: nprocs 2
[r47:18733] mca:rmaps mapping given - using default [r47:18733]
mca:rmaps:ppr: job [63505,1] not using ppr mapper [r47:18733]
mca:rmaps:seq: job [63505,1] not using seq mapper [r47:18733]
mca:rmaps:resilient: cannot perform initial map of job [63505,1]
- no fault groups
[r47:18733] mca:rmaps:mindist: job [63505,1] not using mindist mapper
[r47:18733] mca:rmaps:rr: mapping job [63505,1] [r47:18733] AVAILABLE
NODES FOR MAPPING:
[r47:18733]     node: r47 daemon: 0
[r47:18733]     node: r57 daemon: 1
[r47:18733]     node: r58 daemon: 2
[r47:18733]     node: r59 daemon: 3
[r47:18733] mca:rmaps:rr: mapping no-span by Core for job [63505,1]
slots 64 num_procs 2 [r47:18733] mca:rmaps:rr: found 16 Core objects
on node r47 [r47:18733] mca:rmaps:rr: assigning proc to object 0
----------------------------------------------------------------------
---- A request for multiple cpus-per-proc was given, but a directive
was also give to map to an object level that has less cpus than
requested ones:

    #cpus-per-proc:  1
    number of cpus:  0
    map-by:          BYCORE:NOOVERSUBSCRIBE

Please specify a mapping level that has more cpus, or else let us
define a default mapping that will allow multiple cpus-per-proc.
----------------------------------------------------------------------
---- [r47:18733] mca: base: close: component resilient closed
[r47:18733] mca: base: close: unloading component resilient
[r47:18733] mca: base: close: component rank_file closed [r47:18733]
mca: base: close: unloading component rank_file [r47:18733] mca: base:
close: component staged closed [r47:18733] mca: base: close: unloading
component staged [r47:18733] mca: base: close: component ppr closed
[r47:18733] mca: base: close: unloading component ppr [r47:18733] mca:
base: close: component seq closed [r47:18733] mca: base: close:
unloading component seq [r47:18733] mca: base: close: component
round_robin closed [r47:18733] mca: base: close: unloading component
round_robin [r47:18733] mca: base: close: component mindist closed
[r47:18733] mca: base: close: unloading component mindist

There are both in the same PBS Pro job. And the cpuset definitely has
all cores available:

$ cat /cgroup/cpuset/pbspro/4347646.r-man2/cpuset.cpus
0-15

Is there something here I'm missing?

Cheers,
Ben


_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2016/01/28393.php

_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2016/01/28400.php


_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2016/01/28401.php

Reply via email to