Actually, looking at the output, it appears that we are correctly detecting
the cpus. It looks instead like there is some other setting that is
overriding the discovery.

Is your allocation setting a specific cpuset? Or are you allocating the
entire node?


On Thu, Jan 28, 2016 at 3:19 PM, <tmish...@jcity.maeda.co.jp> wrote:

> Hi Ben and Ralph, just a very short comment.
>
> The error message shows the hardware detection doesn't work well,
> because it says the number of cpus is zero.
>
> >
> >   #cpus-per-proc:  1
> >
> >   number of cpus:  0
> >
> >   map-by:          BYSOCKET:NOOVERSUBSCRIBE
>
> Regards,
> Tetsuya
>
> > Thanks Ralph,
> >
> >
> >
> > There’s no MCA parameters in my environment at all. Here’s the contents
> of openmpi-mca-params.conf:
> >
> >
> >
> > mpi_leave_pinned = 0
> >
> > hwloc_base_binding_policy = core
> >
> > rmaps_base_mapping_policy = core
> >
> > hwloc_base_mem_alloc_policy = local_only
> >
> > shmem_mmap_enable_nfs_warning = 0
> >
> > pml = ^yalla
> >
> > mtl = ^mxm
> >
> > mtl_mxm_np = 0
> >
> > coll = ^fca
> >
> > coll_fca_enable = 1
> >
> > coll_fca_np = 0
> >
> >
> >
> > There are the same as for 1.10.0 (it’s a symlink to the same file).
> There’s nothing there that I can see that would cause it to think that I
> was asking for multiple CPUs per proc. Getting rid of all
> > of the ‘policy’ options doesn’t change the behaviour, except it then says
> >
> >
> >
> > [r51:18193] mca:rmaps:rr: mapping no-span by Socket for job [25745,1]
> slots 32 num_procs 32
> >
> > [r51:18193] mca:rmaps:rr: found 2 Socket objects on node r51
> >
> > [r51:18193] mca:rmaps:rr: assigning proc to object 0
> >
> >
> --------------------------------------------------------------------------
> >
> > A request for multiple cpus-per-proc was given, but a directive
> >
> > was also give to map to an object level that has less cpus than
> >
> > requested ones:
> >
> >
> >
> >   #cpus-per-proc:  1
> >
> >   number of cpus:  0
> >
> >   map-by:          BYSOCKET:NOOVERSUBSCRIBE
> >
> >
> >
> > Please specify a mapping level that has more cpus, or else let us
> >
> > define a default mapping that will allow multiple cpus-per-proc.
> >
> >
> --------------------------------------------------------------------------
> >
> >
> >
> > Forcing it to use ppr instead of rr with ppr:1:core:PE=1 using the MCA
> parameters above gives this:
> >
> >
> >
> > [r51:18320] AVAILABLE NODES FOR MAPPING:
> >
> > [r51:18320]     node: r51 daemon: 0
> >
> > [r51:18320]     node: r58 daemon: 1
> >
> > [r51:18320] mca:rmaps:base: computing vpids by slot for job [25616,1]
> >
> > [r51:18320] mca:rmaps:base: assigning rank 0 to node r51
> >
> > [r51:18320] mca:rmaps:base: assigning rank 1 to node r51
> >
> > [r51:18320] mca:rmaps:base: assigning rank 2 to node r58
> >
> > [r51:18320] mca:rmaps:base: assigning rank 3 to node r58
> >
> > [r51:18320] mca:rmaps: compute bindings for job [25616,1] with policy
> CORE[4008]
> >
> > [r51:18320] [[25616,0],0] reset_usage: node r51 has 2 procs on it
> >
> > [r51:18320] [[25616,0],0] reset_usage: ignoring proc [[25616,1],0]
> >
> > [r51:18320] [[25616,0],0] reset_usage: ignoring proc [[25616,1],1]
> >
> > [r51:18320] [[25616,0],0] bind_depth: 6 map_depth 2
> >
> > [r51:18320] mca:rmaps: bind downward for job [25616,1] with bindings CORE
> >
> >
> --------------------------------------------------------------------------
> >
> > While computing bindings, we found no available cpus on
> >
> > the following node:
> >
> >
> >
> >   Node:  r51
> >
> >
> >
> > Please check your allocation.
> >
> >
> --------------------------------------------------------------------------
> >
> >
> >
> > (actually, it’s the regardless of if it’s socket, core, or node). If I
> get rid of the policy options as above, I get the original error.
> >
> >
> >
> > However, if I do it outside of a PBS job (so no cgroup), it works as I
> would expect. So have there been any changes in the handling of cpusets?
> >
> >
> >
> > Cheers,
> >
> > Ben
> >
> >
> >
> >
> >
> > From:users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph
> Castain
> > Sent: Friday, 29 January 2016 3:46 AM
> > To: Open MPI Users <us...@open-mpi.org>
> > Subject: Re: [OMPI users] Any changes to rmaps in 1.10.2?
> >
> >
> >
> > I'm unaware of any change that would impact you here. For some reason,
> mpirun believes you are requesting multiple cpus-per-proc, and that seems
> to be the heart of the problem. Is there an MCA
> > parameter in your environment or default param file, perhaps?
> >
> >
> >
> >
> >
> > On Wed, Jan 27, 2016 at 2:57 PM, Ben Menadue <ben.mena...@nci.org.au>
> wrote:
> >
> > Hi,
> >
> > Were there any changes to rmaps in going to 1.10.2? An
> otherwise-identical
> > setup that worked in 1.10.0 fails to launch in 1.10.2, complaining that
> > there's no CPUs available in a socket...
> >
> > With 1.10.0:
> >
> > $ /apps/openmpi/1.10.0/bin/mpirun -np 2 -mca rmaps_base_verbose 1000
> > hostname
> > [r47:18709] mca: base: components_register: registering rmaps components
> > [r47:18709] mca: base: components_register: found loaded component
> resilient
> > [r47:18709] mca: base: components_register: component resilient register
> > function successful
> > [r47:18709] mca: base: components_register: found loaded component
> rank_file
> > [r47:18709] mca: base: components_register: component rank_file register
> > function successful
> > [r47:18709] mca: base: components_register: found loaded component staged
> > [r47:18709] mca: base: components_register: component staged has no
> register
> > or open function
> > [r47:18709] mca: base: components_register: found loaded component ppr
> > [r47:18709] mca: base: components_register: component ppr register
> function
> > successful
> > [r47:18709] mca: base: components_register: found loaded component seq
> > [r47:18709] mca: base: components_register: component seq register
> function
> > successful
> > [r47:18709] mca: base: components_register: found loaded component
> > round_robin
> > [r47:18709] mca: base: components_register: component round_robin
> register
> > function successful
> > [r47:18709] mca: base: components_register: found loaded component
> mindist
> > [r47:18709] mca: base: components_register: component mindist register
> > function successful
> > [r47:18709] [[63529,0],0] rmaps:base set policy with core
> > [r47:18709] mca: base: components_open: opening rmaps components
> > [r47:18709] mca: base: components_open: found loaded component resilient
> > [r47:18709] mca: base: components_open: component resilient open function
> > successful
> > [r47:18709] mca: base: components_open: found loaded component rank_file
> > [r47:18709] mca: base: components_open: component rank_file open function
> > successful
> > [r47:18709] mca: base: components_open: found loaded component staged
> > [r47:18709] mca: base: components_open: component staged open function
> > successful
> > [r47:18709] mca: base: components_open: found loaded component ppr
> > [r47:18709] mca: base: components_open: component ppr open function
> > successful
> > [r47:18709] mca: base: components_open: found loaded component seq
> > [r47:18709] mca: base: components_open: component seq open function
> > successful
> > [r47:18709] mca: base: components_open: found loaded component
> round_robin
> > [r47:18709] mca: base: components_open: component round_robin open
> function
> > successful
> > [r47:18709] mca: base: components_open: found loaded component mindist
> > [r47:18709] mca: base: components_open: component mindist open function
> > successful
> > [r47:18709] mca:rmaps:select: checking available component resilient
> > [r47:18709] mca:rmaps:select: Querying component [resilient]
> > [r47:18709] mca:rmaps:select: checking available component rank_file
> > [r47:18709] mca:rmaps:select: Querying component [rank_file]
> > [r47:18709] mca:rmaps:select: checking available component staged
> > [r47:18709] mca:rmaps:select: Querying component [staged]
> > [r47:18709] mca:rmaps:select: checking available component ppr
> > [r47:18709] mca:rmaps:select: Querying component [ppr]
> > [r47:18709] mca:rmaps:select: checking available component seq
> > [r47:18709] mca:rmaps:select: Querying component [seq]
> > [r47:18709] mca:rmaps:select: checking available component round_robin
> > [r47:18709] mca:rmaps:select: Querying component [round_robin]
> > [r47:18709] mca:rmaps:select: checking available component mindist
> > [r47:18709] mca:rmaps:select: Querying component [mindist]
> > [r47:18709] [[63529,0],0]: Final mapper priorities
> > [r47:18709]     Mapper: ppr Priority: 90
> > [r47:18709]     Mapper: seq Priority: 60
> > [r47:18709]     Mapper: resilient Priority: 40
> > [r47:18709]     Mapper: mindist Priority: 20
> > [r47:18709]     Mapper: round_robin Priority: 10
> > [r47:18709]     Mapper: staged Priority: 5
> > [r47:18709]     Mapper: rank_file Priority: 0
> > [r47:18709] mca:rmaps: mapping job [63529,1]
> > [r47:18709] mca:rmaps: creating new map for job [63529,1]
> > [r47:18709] mca:rmaps: nprocs 2
> > [r47:18709] mca:rmaps mapping given - using default
> > [r47:18709] mca:rmaps:ppr: job [63529,1] not using ppr mapper
> > [r47:18709] mca:rmaps:seq: job [63529,1] not using seq mapper
> > [r47:18709] mca:rmaps:resilient: cannot perform initial map of job
> [63529,1]
> > - no fault groups
> > [r47:18709] mca:rmaps:mindist: job [63529,1] not using mindist mapper
> > [r47:18709] mca:rmaps:rr: mapping job [63529,1]
> > [r47:18709] AVAILABLE NODES FOR MAPPING:
> > [r47:18709]     node: r47 daemon: 0
> > [r47:18709]     node: r57 daemon: 1
> > [r47:18709]     node: r58 daemon: 2
> > [r47:18709]     node: r59 daemon: 3
> > [r47:18709] mca:rmaps:rr: mapping no-span by Core for job [63529,1] slots
> 64
> > num_procs 2
> > [r47:18709] mca:rmaps:rr: found 16 Core objects on node r47
> > [r47:18709] mca:rmaps:rr: assigning proc to object 0
> > [r47:18709] mca:rmaps:rr: assigning proc to object 1
> > [r47:18709] mca:rmaps: computing ranks by core for job [63529,1]
> > [r47:18709] mca:rmaps:rank_by: found 16 objects on node r47 with 2 procs
> > [r47:18709] mca:rmaps:rank_by: assigned rank 0
> > [r47:18709] mca:rmaps:rank_by: assigned rank 1
> > [r47:18709] mca:rmaps:rank_by: found 16 objects on node r57 with 0 procs
> > [r47:18709] mca:rmaps:rank_by: found 16 objects on node r58 with 0 procs
> > [r47:18709] mca:rmaps:rank_by: found 16 objects on node r59 with 0 procs
> > [r47:18709] mca:rmaps: compute bindings for job [63529,1] with policy
> > CORE[4008]
> > [r47:18709] mca:rmaps: bindings for job [63529,1] - bind in place
> > [r47:18709] mca:rmaps: bind in place for job [63529,1] with bindings CORE
> > [r47:18709] [[63529,0],0] reset_usage: node r47 has 2 procs on it
> > [r47:18709] [[63529,0],0] reset_usage: ignoring proc [[63529,1],0]
> > [r47:18709] [[63529,0],0] reset_usage: ignoring proc [[63529,1],1]
> > [r47:18709] BINDING PROC [[63529,1],0] TO Core NUMBER 0
> > [r47:18709] [[63529,0],0] BOUND PROC [[63529,1],0] TO 0[Core:0] on node
> r47
> > [r47:18709] BINDING PROC [[63529,1],1] TO Core NUMBER 1
> > [r47:18709] [[63529,0],0] BOUND PROC [[63529,1],1] TO 1[Core:1] on node
> r47
> > r47
> > r47
> > [r47:18709] mca: base: close: component resilient closed
> > [r47:18709] mca: base: close: unloading component resilient
> > [r47:18709] mca: base: close: component rank_file closed
> > [r47:18709] mca: base: close: unloading component rank_file
> > [r47:18709] mca: base: close: component staged closed
> > [r47:18709] mca: base: close: unloading component staged
> > [r47:18709] mca: base: close: component ppr closed
> > [r47:18709] mca: base: close: unloading component ppr
> > [r47:18709] mca: base: close: component seq closed
> > [r47:18709] mca: base: close: unloading component seq
> > [r47:18709] mca: base: close: component round_robin closed
> > [r47:18709] mca: base: close: unloading component round_robin
> > [r47:18709] mca: base: close: component mindist closed
> > [r47:18709] mca: base: close: unloading component mindist
> >
> > With 1.10.2:
> >
> > $ /apps/openmpi/1.10.2/bin/mpirun -np 2 -mca rmaps_base_verbose 1000
> > hostname
> > [r47:18733] mca: base: components_register: registering rmaps components
> > [r47:18733] mca: base: components_register: found loaded component
> resilient
> > [r47:18733] mca: base: components_register: component resilient register
> > function successful
> > [r47:18733] mca: base: components_register: found loaded component
> rank_file
> > [r47:18733] mca: base: components_register: component rank_file register
> > function successful
> > [r47:18733] mca: base: components_register: found loaded component staged
> > [r47:18733] mca: base: components_register: component staged has no
> register
> > or open function
> > [r47:18733] mca: base: components_register: found loaded component ppr
> > [r47:18733] mca: base: components_register: component ppr register
> function
> > successful
> > [r47:18733] mca: base: components_register: found loaded component seq
> > [r47:18733] mca: base: components_register: component seq register
> function
> > successful
> > [r47:18733] mca: base: components_register: found loaded component
> > round_robin
> > [r47:18733] mca: base: components_register: component round_robin
> register
> > function successful
> > [r47:18733] mca: base: components_register: found loaded component
> mindist
> > [r47:18733] mca: base: components_register: component mindist register
> > function successful
> > [r47:18733] [[63505,0],0] rmaps:base set policy with core
> > [r47:18733] mca: base: components_open: opening rmaps components
> > [r47:18733] mca: base: components_open: found loaded component resilient
> > [r47:18733] mca: base: components_open: component resilient open function
> > successful
> > [r47:18733] mca: base: components_open: found loaded component rank_file
> > [r47:18733] mca: base: components_open: component rank_file open function
> > successful
> > [r47:18733] mca: base: components_open: found loaded component staged
> > [r47:18733] mca: base: components_open: component staged open function
> > successful
> > [r47:18733] mca: base: components_open: found loaded component ppr
> > [r47:18733] mca: base: components_open: component ppr open function
> > successful
> > [r47:18733] mca: base: components_open: found loaded component seq
> > [r47:18733] mca: base: components_open: component seq open function
> > successful
> > [r47:18733] mca: base: components_open: found loaded component
> round_robin
> > [r47:18733] mca: base: components_open: component round_robin open
> function
> > successful
> > [r47:18733] mca: base: components_open: found loaded component mindist
> > [r47:18733] mca: base: components_open: component mindist open function
> > successful
> > [r47:18733] mca:rmaps:select: checking available component resilient
> > [r47:18733] mca:rmaps:select: Querying component [resilient]
> > [r47:18733] mca:rmaps:select: checking available component rank_file
> > [r47:18733] mca:rmaps:select: Querying component [rank_file]
> > [r47:18733] mca:rmaps:select: checking available component staged
> > [r47:18733] mca:rmaps:select: Querying component [staged]
> > [r47:18733] mca:rmaps:select: checking available component ppr
> > [r47:18733] mca:rmaps:select: Querying component [ppr]
> > [r47:18733] mca:rmaps:select: checking available component seq
> > [r47:18733] mca:rmaps:select: Querying component [seq]
> > [r47:18733] mca:rmaps:select: checking available component round_robin
> > [r47:18733] mca:rmaps:select: Querying component [round_robin]
> > [r47:18733] mca:rmaps:select: checking available component mindist
> > [r47:18733] mca:rmaps:select: Querying component [mindist]
> > [r47:18733] [[63505,0],0]: Final mapper priorities
> > [r47:18733]     Mapper: ppr Priority: 90
> > [r47:18733]     Mapper: seq Priority: 60
> > [r47:18733]     Mapper: resilient Priority: 40
> > [r47:18733]     Mapper: mindist Priority: 20
> > [r47:18733]     Mapper: round_robin Priority: 10
> > [r47:18733]     Mapper: staged Priority: 5
> > [r47:18733]     Mapper: rank_file Priority: 0
> > [r47:18733] mca:rmaps: mapping job [63505,1]
> > [r47:18733] mca:rmaps: creating new map for job [63505,1]
> > [r47:18733] mca:rmaps: nprocs 2
> > [r47:18733] mca:rmaps mapping given - using default
> > [r47:18733] mca:rmaps:ppr: job [63505,1] not using ppr mapper
> > [r47:18733] mca:rmaps:seq: job [63505,1] not using seq mapper
> > [r47:18733] mca:rmaps:resilient: cannot perform initial map of job
> [63505,1]
> > - no fault groups
> > [r47:18733] mca:rmaps:mindist: job [63505,1] not using mindist mapper
> > [r47:18733] mca:rmaps:rr: mapping job [63505,1]
> > [r47:18733] AVAILABLE NODES FOR MAPPING:
> > [r47:18733]     node: r47 daemon: 0
> > [r47:18733]     node: r57 daemon: 1
> > [r47:18733]     node: r58 daemon: 2
> > [r47:18733]     node: r59 daemon: 3
> > [r47:18733] mca:rmaps:rr: mapping no-span by Core for job [63505,1] slots
> 64
> > num_procs 2
> > [r47:18733] mca:rmaps:rr: found 16 Core objects on node r47
> > [r47:18733] mca:rmaps:rr: assigning proc to object 0
> >
> --------------------------------------------------------------------------
> > A request for multiple cpus-per-proc was given, but a directive
> > was also give to map to an object level that has less cpus than
> > requested ones:
> >
> >   #cpus-per-proc:  1
> >   number of cpus:  0
> >   map-by:          BYCORE:NOOVERSUBSCRIBE
> >
> > Please specify a mapping level that has more cpus, or else let us
> > define a default mapping that will allow multiple cpus-per-proc.
> >
> --------------------------------------------------------------------------
> > [r47:18733] mca: base: close: component resilient closed
> > [r47:18733] mca: base: close: unloading component resilient
> > [r47:18733] mca: base: close: component rank_file closed
> > [r47:18733] mca: base: close: unloading component rank_file
> > [r47:18733] mca: base: close: component staged closed
> > [r47:18733] mca: base: close: unloading component staged
> > [r47:18733] mca: base: close: component ppr closed
> > [r47:18733] mca: base: close: unloading component ppr
> > [r47:18733] mca: base: close: component seq closed
> > [r47:18733] mca: base: close: unloading component seq
> > [r47:18733] mca: base: close: component round_robin closed
> > [r47:18733] mca: base: close: unloading component round_robin
> > [r47:18733] mca: base: close: component mindist closed
> > [r47:18733] mca: base: close: unloading component mindist
> >
> > There are both in the same PBS Pro job. And the cpuset definitely has all
> > cores available:
> >
> > $ cat /cgroup/cpuset/pbspro/4347646.r-man2/cpuset.cpus
> > 0-15
> >
> > Is there something here I'm missing?
> >
> > Cheers,
> > Ben
> >
> >
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> > Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/01/28393.php
> >
> >  _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/usersLink to
> this post: http://www.open-mpi.org/community/lists/users/2016/01/28397.php
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/01/28398.php

Reply via email to