Hi Ben and Ralph, just a very short comment.

The error message shows the hardware detection doesn't work well,
because it says the number of cpus is zero.

>
>   #cpus-per-proc:  1
>
>   number of cpus:  0
>
>   map-by:          BYSOCKET:NOOVERSUBSCRIBE

Regards,
Tetsuya

> Thanks Ralph,
>
>
>
> There’s no MCA parameters in my environment at all. Here’s the contents
of openmpi-mca-params.conf:
>
>
>
> mpi_leave_pinned = 0
>
> hwloc_base_binding_policy = core
>
> rmaps_base_mapping_policy = core
>
> hwloc_base_mem_alloc_policy = local_only
>
> shmem_mmap_enable_nfs_warning = 0
>
> pml = ^yalla
>
> mtl = ^mxm
>
> mtl_mxm_np = 0
>
> coll = ^fca
>
> coll_fca_enable = 1
>
> coll_fca_np = 0
>
>
>
> There are the same as for 1.10.0 (it’s a symlink to the same file).
There’s nothing there that I can see that would cause it to think that I
was asking for multiple CPUs per proc. Getting rid of all
> of the ‘policy’ options doesn’t change the behaviour, except it then says
>
>
>
> [r51:18193] mca:rmaps:rr: mapping no-span by Socket for job [25745,1]
slots 32 num_procs 32
>
> [r51:18193] mca:rmaps:rr: found 2 Socket objects on node r51
>
> [r51:18193] mca:rmaps:rr: assigning proc to object 0
>
>
--------------------------------------------------------------------------
>
> A request for multiple cpus-per-proc was given, but a directive
>
> was also give to map to an object level that has less cpus than
>
> requested ones:
>
>
>
>   #cpus-per-proc:  1
>
>   number of cpus:  0
>
>   map-by:          BYSOCKET:NOOVERSUBSCRIBE
>
>
>
> Please specify a mapping level that has more cpus, or else let us
>
> define a default mapping that will allow multiple cpus-per-proc.
>
>
--------------------------------------------------------------------------
>
>
>
> Forcing it to use ppr instead of rr with ppr:1:core:PE=1 using the MCA
parameters above gives this:
>
>
>
> [r51:18320] AVAILABLE NODES FOR MAPPING:
>
> [r51:18320]     node: r51 daemon: 0
>
> [r51:18320]     node: r58 daemon: 1
>
> [r51:18320] mca:rmaps:base: computing vpids by slot for job [25616,1]
>
> [r51:18320] mca:rmaps:base: assigning rank 0 to node r51
>
> [r51:18320] mca:rmaps:base: assigning rank 1 to node r51
>
> [r51:18320] mca:rmaps:base: assigning rank 2 to node r58
>
> [r51:18320] mca:rmaps:base: assigning rank 3 to node r58
>
> [r51:18320] mca:rmaps: compute bindings for job [25616,1] with policy
CORE[4008]
>
> [r51:18320] [[25616,0],0] reset_usage: node r51 has 2 procs on it
>
> [r51:18320] [[25616,0],0] reset_usage: ignoring proc [[25616,1],0]
>
> [r51:18320] [[25616,0],0] reset_usage: ignoring proc [[25616,1],1]
>
> [r51:18320] [[25616,0],0] bind_depth: 6 map_depth 2
>
> [r51:18320] mca:rmaps: bind downward for job [25616,1] with bindings CORE
>
>
--------------------------------------------------------------------------
>
> While computing bindings, we found no available cpus on
>
> the following node:
>
>
>
>   Node:  r51
>
>
>
> Please check your allocation.
>
>
--------------------------------------------------------------------------
>
>
>
> (actually, it’s the regardless of if it’s socket, core, or node). If I
get rid of the policy options as above, I get the original error.
>
>
>
> However, if I do it outside of a PBS job (so no cgroup), it works as I
would expect. So have there been any changes in the handling of cpusets?
>
>
>
> Cheers,
>
> Ben
>
>
>
>
>
> From:users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
> Sent: Friday, 29 January 2016 3:46 AM
> To: Open MPI Users <us...@open-mpi.org>
> Subject: Re: [OMPI users] Any changes to rmaps in 1.10.2?
>
>
>
> I'm unaware of any change that would impact you here. For some reason,
mpirun believes you are requesting multiple cpus-per-proc, and that seems
to be the heart of the problem. Is there an MCA
> parameter in your environment or default param file, perhaps?
>
>
>
>
>
> On Wed, Jan 27, 2016 at 2:57 PM, Ben Menadue <ben.mena...@nci.org.au>
wrote:
>
> Hi,
>
> Were there any changes to rmaps in going to 1.10.2? An
otherwise-identical
> setup that worked in 1.10.0 fails to launch in 1.10.2, complaining that
> there's no CPUs available in a socket...
>
> With 1.10.0:
>
> $ /apps/openmpi/1.10.0/bin/mpirun -np 2 -mca rmaps_base_verbose 1000
> hostname
> [r47:18709] mca: base: components_register: registering rmaps components
> [r47:18709] mca: base: components_register: found loaded component
resilient
> [r47:18709] mca: base: components_register: component resilient register
> function successful
> [r47:18709] mca: base: components_register: found loaded component
rank_file
> [r47:18709] mca: base: components_register: component rank_file register
> function successful
> [r47:18709] mca: base: components_register: found loaded component staged
> [r47:18709] mca: base: components_register: component staged has no
register
> or open function
> [r47:18709] mca: base: components_register: found loaded component ppr
> [r47:18709] mca: base: components_register: component ppr register
function
> successful
> [r47:18709] mca: base: components_register: found loaded component seq
> [r47:18709] mca: base: components_register: component seq register
function
> successful
> [r47:18709] mca: base: components_register: found loaded component
> round_robin
> [r47:18709] mca: base: components_register: component round_robin
register
> function successful
> [r47:18709] mca: base: components_register: found loaded component
mindist
> [r47:18709] mca: base: components_register: component mindist register
> function successful
> [r47:18709] [[63529,0],0] rmaps:base set policy with core
> [r47:18709] mca: base: components_open: opening rmaps components
> [r47:18709] mca: base: components_open: found loaded component resilient
> [r47:18709] mca: base: components_open: component resilient open function
> successful
> [r47:18709] mca: base: components_open: found loaded component rank_file
> [r47:18709] mca: base: components_open: component rank_file open function
> successful
> [r47:18709] mca: base: components_open: found loaded component staged
> [r47:18709] mca: base: components_open: component staged open function
> successful
> [r47:18709] mca: base: components_open: found loaded component ppr
> [r47:18709] mca: base: components_open: component ppr open function
> successful
> [r47:18709] mca: base: components_open: found loaded component seq
> [r47:18709] mca: base: components_open: component seq open function
> successful
> [r47:18709] mca: base: components_open: found loaded component
round_robin
> [r47:18709] mca: base: components_open: component round_robin open
function
> successful
> [r47:18709] mca: base: components_open: found loaded component mindist
> [r47:18709] mca: base: components_open: component mindist open function
> successful
> [r47:18709] mca:rmaps:select: checking available component resilient
> [r47:18709] mca:rmaps:select: Querying component [resilient]
> [r47:18709] mca:rmaps:select: checking available component rank_file
> [r47:18709] mca:rmaps:select: Querying component [rank_file]
> [r47:18709] mca:rmaps:select: checking available component staged
> [r47:18709] mca:rmaps:select: Querying component [staged]
> [r47:18709] mca:rmaps:select: checking available component ppr
> [r47:18709] mca:rmaps:select: Querying component [ppr]
> [r47:18709] mca:rmaps:select: checking available component seq
> [r47:18709] mca:rmaps:select: Querying component [seq]
> [r47:18709] mca:rmaps:select: checking available component round_robin
> [r47:18709] mca:rmaps:select: Querying component [round_robin]
> [r47:18709] mca:rmaps:select: checking available component mindist
> [r47:18709] mca:rmaps:select: Querying component [mindist]
> [r47:18709] [[63529,0],0]: Final mapper priorities
> [r47:18709]     Mapper: ppr Priority: 90
> [r47:18709]     Mapper: seq Priority: 60
> [r47:18709]     Mapper: resilient Priority: 40
> [r47:18709]     Mapper: mindist Priority: 20
> [r47:18709]     Mapper: round_robin Priority: 10
> [r47:18709]     Mapper: staged Priority: 5
> [r47:18709]     Mapper: rank_file Priority: 0
> [r47:18709] mca:rmaps: mapping job [63529,1]
> [r47:18709] mca:rmaps: creating new map for job [63529,1]
> [r47:18709] mca:rmaps: nprocs 2
> [r47:18709] mca:rmaps mapping given - using default
> [r47:18709] mca:rmaps:ppr: job [63529,1] not using ppr mapper
> [r47:18709] mca:rmaps:seq: job [63529,1] not using seq mapper
> [r47:18709] mca:rmaps:resilient: cannot perform initial map of job
[63529,1]
> - no fault groups
> [r47:18709] mca:rmaps:mindist: job [63529,1] not using mindist mapper
> [r47:18709] mca:rmaps:rr: mapping job [63529,1]
> [r47:18709] AVAILABLE NODES FOR MAPPING:
> [r47:18709]     node: r47 daemon: 0
> [r47:18709]     node: r57 daemon: 1
> [r47:18709]     node: r58 daemon: 2
> [r47:18709]     node: r59 daemon: 3
> [r47:18709] mca:rmaps:rr: mapping no-span by Core for job [63529,1] slots
64
> num_procs 2
> [r47:18709] mca:rmaps:rr: found 16 Core objects on node r47
> [r47:18709] mca:rmaps:rr: assigning proc to object 0
> [r47:18709] mca:rmaps:rr: assigning proc to object 1
> [r47:18709] mca:rmaps: computing ranks by core for job [63529,1]
> [r47:18709] mca:rmaps:rank_by: found 16 objects on node r47 with 2 procs
> [r47:18709] mca:rmaps:rank_by: assigned rank 0
> [r47:18709] mca:rmaps:rank_by: assigned rank 1
> [r47:18709] mca:rmaps:rank_by: found 16 objects on node r57 with 0 procs
> [r47:18709] mca:rmaps:rank_by: found 16 objects on node r58 with 0 procs
> [r47:18709] mca:rmaps:rank_by: found 16 objects on node r59 with 0 procs
> [r47:18709] mca:rmaps: compute bindings for job [63529,1] with policy
> CORE[4008]
> [r47:18709] mca:rmaps: bindings for job [63529,1] - bind in place
> [r47:18709] mca:rmaps: bind in place for job [63529,1] with bindings CORE
> [r47:18709] [[63529,0],0] reset_usage: node r47 has 2 procs on it
> [r47:18709] [[63529,0],0] reset_usage: ignoring proc [[63529,1],0]
> [r47:18709] [[63529,0],0] reset_usage: ignoring proc [[63529,1],1]
> [r47:18709] BINDING PROC [[63529,1],0] TO Core NUMBER 0
> [r47:18709] [[63529,0],0] BOUND PROC [[63529,1],0] TO 0[Core:0] on node
r47
> [r47:18709] BINDING PROC [[63529,1],1] TO Core NUMBER 1
> [r47:18709] [[63529,0],0] BOUND PROC [[63529,1],1] TO 1[Core:1] on node
r47
> r47
> r47
> [r47:18709] mca: base: close: component resilient closed
> [r47:18709] mca: base: close: unloading component resilient
> [r47:18709] mca: base: close: component rank_file closed
> [r47:18709] mca: base: close: unloading component rank_file
> [r47:18709] mca: base: close: component staged closed
> [r47:18709] mca: base: close: unloading component staged
> [r47:18709] mca: base: close: component ppr closed
> [r47:18709] mca: base: close: unloading component ppr
> [r47:18709] mca: base: close: component seq closed
> [r47:18709] mca: base: close: unloading component seq
> [r47:18709] mca: base: close: component round_robin closed
> [r47:18709] mca: base: close: unloading component round_robin
> [r47:18709] mca: base: close: component mindist closed
> [r47:18709] mca: base: close: unloading component mindist
>
> With 1.10.2:
>
> $ /apps/openmpi/1.10.2/bin/mpirun -np 2 -mca rmaps_base_verbose 1000
> hostname
> [r47:18733] mca: base: components_register: registering rmaps components
> [r47:18733] mca: base: components_register: found loaded component
resilient
> [r47:18733] mca: base: components_register: component resilient register
> function successful
> [r47:18733] mca: base: components_register: found loaded component
rank_file
> [r47:18733] mca: base: components_register: component rank_file register
> function successful
> [r47:18733] mca: base: components_register: found loaded component staged
> [r47:18733] mca: base: components_register: component staged has no
register
> or open function
> [r47:18733] mca: base: components_register: found loaded component ppr
> [r47:18733] mca: base: components_register: component ppr register
function
> successful
> [r47:18733] mca: base: components_register: found loaded component seq
> [r47:18733] mca: base: components_register: component seq register
function
> successful
> [r47:18733] mca: base: components_register: found loaded component
> round_robin
> [r47:18733] mca: base: components_register: component round_robin
register
> function successful
> [r47:18733] mca: base: components_register: found loaded component
mindist
> [r47:18733] mca: base: components_register: component mindist register
> function successful
> [r47:18733] [[63505,0],0] rmaps:base set policy with core
> [r47:18733] mca: base: components_open: opening rmaps components
> [r47:18733] mca: base: components_open: found loaded component resilient
> [r47:18733] mca: base: components_open: component resilient open function
> successful
> [r47:18733] mca: base: components_open: found loaded component rank_file
> [r47:18733] mca: base: components_open: component rank_file open function
> successful
> [r47:18733] mca: base: components_open: found loaded component staged
> [r47:18733] mca: base: components_open: component staged open function
> successful
> [r47:18733] mca: base: components_open: found loaded component ppr
> [r47:18733] mca: base: components_open: component ppr open function
> successful
> [r47:18733] mca: base: components_open: found loaded component seq
> [r47:18733] mca: base: components_open: component seq open function
> successful
> [r47:18733] mca: base: components_open: found loaded component
round_robin
> [r47:18733] mca: base: components_open: component round_robin open
function
> successful
> [r47:18733] mca: base: components_open: found loaded component mindist
> [r47:18733] mca: base: components_open: component mindist open function
> successful
> [r47:18733] mca:rmaps:select: checking available component resilient
> [r47:18733] mca:rmaps:select: Querying component [resilient]
> [r47:18733] mca:rmaps:select: checking available component rank_file
> [r47:18733] mca:rmaps:select: Querying component [rank_file]
> [r47:18733] mca:rmaps:select: checking available component staged
> [r47:18733] mca:rmaps:select: Querying component [staged]
> [r47:18733] mca:rmaps:select: checking available component ppr
> [r47:18733] mca:rmaps:select: Querying component [ppr]
> [r47:18733] mca:rmaps:select: checking available component seq
> [r47:18733] mca:rmaps:select: Querying component [seq]
> [r47:18733] mca:rmaps:select: checking available component round_robin
> [r47:18733] mca:rmaps:select: Querying component [round_robin]
> [r47:18733] mca:rmaps:select: checking available component mindist
> [r47:18733] mca:rmaps:select: Querying component [mindist]
> [r47:18733] [[63505,0],0]: Final mapper priorities
> [r47:18733]     Mapper: ppr Priority: 90
> [r47:18733]     Mapper: seq Priority: 60
> [r47:18733]     Mapper: resilient Priority: 40
> [r47:18733]     Mapper: mindist Priority: 20
> [r47:18733]     Mapper: round_robin Priority: 10
> [r47:18733]     Mapper: staged Priority: 5
> [r47:18733]     Mapper: rank_file Priority: 0
> [r47:18733] mca:rmaps: mapping job [63505,1]
> [r47:18733] mca:rmaps: creating new map for job [63505,1]
> [r47:18733] mca:rmaps: nprocs 2
> [r47:18733] mca:rmaps mapping given - using default
> [r47:18733] mca:rmaps:ppr: job [63505,1] not using ppr mapper
> [r47:18733] mca:rmaps:seq: job [63505,1] not using seq mapper
> [r47:18733] mca:rmaps:resilient: cannot perform initial map of job
[63505,1]
> - no fault groups
> [r47:18733] mca:rmaps:mindist: job [63505,1] not using mindist mapper
> [r47:18733] mca:rmaps:rr: mapping job [63505,1]
> [r47:18733] AVAILABLE NODES FOR MAPPING:
> [r47:18733]     node: r47 daemon: 0
> [r47:18733]     node: r57 daemon: 1
> [r47:18733]     node: r58 daemon: 2
> [r47:18733]     node: r59 daemon: 3
> [r47:18733] mca:rmaps:rr: mapping no-span by Core for job [63505,1] slots
64
> num_procs 2
> [r47:18733] mca:rmaps:rr: found 16 Core objects on node r47
> [r47:18733] mca:rmaps:rr: assigning proc to object 0
>
--------------------------------------------------------------------------
> A request for multiple cpus-per-proc was given, but a directive
> was also give to map to an object level that has less cpus than
> requested ones:
>
>   #cpus-per-proc:  1
>   number of cpus:  0
>   map-by:          BYCORE:NOOVERSUBSCRIBE
>
> Please specify a mapping level that has more cpus, or else let us
> define a default mapping that will allow multiple cpus-per-proc.
>
--------------------------------------------------------------------------
> [r47:18733] mca: base: close: component resilient closed
> [r47:18733] mca: base: close: unloading component resilient
> [r47:18733] mca: base: close: component rank_file closed
> [r47:18733] mca: base: close: unloading component rank_file
> [r47:18733] mca: base: close: component staged closed
> [r47:18733] mca: base: close: unloading component staged
> [r47:18733] mca: base: close: component ppr closed
> [r47:18733] mca: base: close: unloading component ppr
> [r47:18733] mca: base: close: component seq closed
> [r47:18733] mca: base: close: unloading component seq
> [r47:18733] mca: base: close: component round_robin closed
> [r47:18733] mca: base: close: unloading component round_robin
> [r47:18733] mca: base: close: component mindist closed
> [r47:18733] mca: base: close: unloading component mindist
>
> There are both in the same PBS Pro job. And the cpuset definitely has all
> cores available:
>
> $ cat /cgroup/cpuset/pbspro/4347646.r-man2/cpuset.cpus
> 0-15
>
> Is there something here I'm missing?
>
> Cheers,
> Ben
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
http://www.open-mpi.org/community/lists/users/2016/01/28393.php
>
>  _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/usersLink to
this post: http://www.open-mpi.org/community/lists/users/2016/01/28397.php

Reply via email to