Hi Ben and Ralph, just a very short comment. The error message shows the hardware detection doesn't work well, because it says the number of cpus is zero.
> > #cpus-per-proc: 1 > > number of cpus: 0 > > map-by: BYSOCKET:NOOVERSUBSCRIBE Regards, Tetsuya > Thanks Ralph, > > > > There’s no MCA parameters in my environment at all. Here’s the contents of openmpi-mca-params.conf: > > > > mpi_leave_pinned = 0 > > hwloc_base_binding_policy = core > > rmaps_base_mapping_policy = core > > hwloc_base_mem_alloc_policy = local_only > > shmem_mmap_enable_nfs_warning = 0 > > pml = ^yalla > > mtl = ^mxm > > mtl_mxm_np = 0 > > coll = ^fca > > coll_fca_enable = 1 > > coll_fca_np = 0 > > > > There are the same as for 1.10.0 (it’s a symlink to the same file). There’s nothing there that I can see that would cause it to think that I was asking for multiple CPUs per proc. Getting rid of all > of the ‘policy’ options doesn’t change the behaviour, except it then says > > > > [r51:18193] mca:rmaps:rr: mapping no-span by Socket for job [25745,1] slots 32 num_procs 32 > > [r51:18193] mca:rmaps:rr: found 2 Socket objects on node r51 > > [r51:18193] mca:rmaps:rr: assigning proc to object 0 > > -------------------------------------------------------------------------- > > A request for multiple cpus-per-proc was given, but a directive > > was also give to map to an object level that has less cpus than > > requested ones: > > > > #cpus-per-proc: 1 > > number of cpus: 0 > > map-by: BYSOCKET:NOOVERSUBSCRIBE > > > > Please specify a mapping level that has more cpus, or else let us > > define a default mapping that will allow multiple cpus-per-proc. > > -------------------------------------------------------------------------- > > > > Forcing it to use ppr instead of rr with ppr:1:core:PE=1 using the MCA parameters above gives this: > > > > [r51:18320] AVAILABLE NODES FOR MAPPING: > > [r51:18320] node: r51 daemon: 0 > > [r51:18320] node: r58 daemon: 1 > > [r51:18320] mca:rmaps:base: computing vpids by slot for job [25616,1] > > [r51:18320] mca:rmaps:base: assigning rank 0 to node r51 > > [r51:18320] mca:rmaps:base: assigning rank 1 to node r51 > > [r51:18320] mca:rmaps:base: assigning rank 2 to node r58 > > [r51:18320] mca:rmaps:base: assigning rank 3 to node r58 > > [r51:18320] mca:rmaps: compute bindings for job [25616,1] with policy CORE[4008] > > [r51:18320] [[25616,0],0] reset_usage: node r51 has 2 procs on it > > [r51:18320] [[25616,0],0] reset_usage: ignoring proc [[25616,1],0] > > [r51:18320] [[25616,0],0] reset_usage: ignoring proc [[25616,1],1] > > [r51:18320] [[25616,0],0] bind_depth: 6 map_depth 2 > > [r51:18320] mca:rmaps: bind downward for job [25616,1] with bindings CORE > > -------------------------------------------------------------------------- > > While computing bindings, we found no available cpus on > > the following node: > > > > Node: r51 > > > > Please check your allocation. > > -------------------------------------------------------------------------- > > > > (actually, it’s the regardless of if it’s socket, core, or node). If I get rid of the policy options as above, I get the original error. > > > > However, if I do it outside of a PBS job (so no cgroup), it works as I would expect. So have there been any changes in the handling of cpusets? > > > > Cheers, > > Ben > > > > > > From:users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain > Sent: Friday, 29 January 2016 3:46 AM > To: Open MPI Users <us...@open-mpi.org> > Subject: Re: [OMPI users] Any changes to rmaps in 1.10.2? > > > > I'm unaware of any change that would impact you here. For some reason, mpirun believes you are requesting multiple cpus-per-proc, and that seems to be the heart of the problem. Is there an MCA > parameter in your environment or default param file, perhaps? > > > > > > On Wed, Jan 27, 2016 at 2:57 PM, Ben Menadue <ben.mena...@nci.org.au> wrote: > > Hi, > > Were there any changes to rmaps in going to 1.10.2? An otherwise-identical > setup that worked in 1.10.0 fails to launch in 1.10.2, complaining that > there's no CPUs available in a socket... > > With 1.10.0: > > $ /apps/openmpi/1.10.0/bin/mpirun -np 2 -mca rmaps_base_verbose 1000 > hostname > [r47:18709] mca: base: components_register: registering rmaps components > [r47:18709] mca: base: components_register: found loaded component resilient > [r47:18709] mca: base: components_register: component resilient register > function successful > [r47:18709] mca: base: components_register: found loaded component rank_file > [r47:18709] mca: base: components_register: component rank_file register > function successful > [r47:18709] mca: base: components_register: found loaded component staged > [r47:18709] mca: base: components_register: component staged has no register > or open function > [r47:18709] mca: base: components_register: found loaded component ppr > [r47:18709] mca: base: components_register: component ppr register function > successful > [r47:18709] mca: base: components_register: found loaded component seq > [r47:18709] mca: base: components_register: component seq register function > successful > [r47:18709] mca: base: components_register: found loaded component > round_robin > [r47:18709] mca: base: components_register: component round_robin register > function successful > [r47:18709] mca: base: components_register: found loaded component mindist > [r47:18709] mca: base: components_register: component mindist register > function successful > [r47:18709] [[63529,0],0] rmaps:base set policy with core > [r47:18709] mca: base: components_open: opening rmaps components > [r47:18709] mca: base: components_open: found loaded component resilient > [r47:18709] mca: base: components_open: component resilient open function > successful > [r47:18709] mca: base: components_open: found loaded component rank_file > [r47:18709] mca: base: components_open: component rank_file open function > successful > [r47:18709] mca: base: components_open: found loaded component staged > [r47:18709] mca: base: components_open: component staged open function > successful > [r47:18709] mca: base: components_open: found loaded component ppr > [r47:18709] mca: base: components_open: component ppr open function > successful > [r47:18709] mca: base: components_open: found loaded component seq > [r47:18709] mca: base: components_open: component seq open function > successful > [r47:18709] mca: base: components_open: found loaded component round_robin > [r47:18709] mca: base: components_open: component round_robin open function > successful > [r47:18709] mca: base: components_open: found loaded component mindist > [r47:18709] mca: base: components_open: component mindist open function > successful > [r47:18709] mca:rmaps:select: checking available component resilient > [r47:18709] mca:rmaps:select: Querying component [resilient] > [r47:18709] mca:rmaps:select: checking available component rank_file > [r47:18709] mca:rmaps:select: Querying component [rank_file] > [r47:18709] mca:rmaps:select: checking available component staged > [r47:18709] mca:rmaps:select: Querying component [staged] > [r47:18709] mca:rmaps:select: checking available component ppr > [r47:18709] mca:rmaps:select: Querying component [ppr] > [r47:18709] mca:rmaps:select: checking available component seq > [r47:18709] mca:rmaps:select: Querying component [seq] > [r47:18709] mca:rmaps:select: checking available component round_robin > [r47:18709] mca:rmaps:select: Querying component [round_robin] > [r47:18709] mca:rmaps:select: checking available component mindist > [r47:18709] mca:rmaps:select: Querying component [mindist] > [r47:18709] [[63529,0],0]: Final mapper priorities > [r47:18709] Mapper: ppr Priority: 90 > [r47:18709] Mapper: seq Priority: 60 > [r47:18709] Mapper: resilient Priority: 40 > [r47:18709] Mapper: mindist Priority: 20 > [r47:18709] Mapper: round_robin Priority: 10 > [r47:18709] Mapper: staged Priority: 5 > [r47:18709] Mapper: rank_file Priority: 0 > [r47:18709] mca:rmaps: mapping job [63529,1] > [r47:18709] mca:rmaps: creating new map for job [63529,1] > [r47:18709] mca:rmaps: nprocs 2 > [r47:18709] mca:rmaps mapping given - using default > [r47:18709] mca:rmaps:ppr: job [63529,1] not using ppr mapper > [r47:18709] mca:rmaps:seq: job [63529,1] not using seq mapper > [r47:18709] mca:rmaps:resilient: cannot perform initial map of job [63529,1] > - no fault groups > [r47:18709] mca:rmaps:mindist: job [63529,1] not using mindist mapper > [r47:18709] mca:rmaps:rr: mapping job [63529,1] > [r47:18709] AVAILABLE NODES FOR MAPPING: > [r47:18709] node: r47 daemon: 0 > [r47:18709] node: r57 daemon: 1 > [r47:18709] node: r58 daemon: 2 > [r47:18709] node: r59 daemon: 3 > [r47:18709] mca:rmaps:rr: mapping no-span by Core for job [63529,1] slots 64 > num_procs 2 > [r47:18709] mca:rmaps:rr: found 16 Core objects on node r47 > [r47:18709] mca:rmaps:rr: assigning proc to object 0 > [r47:18709] mca:rmaps:rr: assigning proc to object 1 > [r47:18709] mca:rmaps: computing ranks by core for job [63529,1] > [r47:18709] mca:rmaps:rank_by: found 16 objects on node r47 with 2 procs > [r47:18709] mca:rmaps:rank_by: assigned rank 0 > [r47:18709] mca:rmaps:rank_by: assigned rank 1 > [r47:18709] mca:rmaps:rank_by: found 16 objects on node r57 with 0 procs > [r47:18709] mca:rmaps:rank_by: found 16 objects on node r58 with 0 procs > [r47:18709] mca:rmaps:rank_by: found 16 objects on node r59 with 0 procs > [r47:18709] mca:rmaps: compute bindings for job [63529,1] with policy > CORE[4008] > [r47:18709] mca:rmaps: bindings for job [63529,1] - bind in place > [r47:18709] mca:rmaps: bind in place for job [63529,1] with bindings CORE > [r47:18709] [[63529,0],0] reset_usage: node r47 has 2 procs on it > [r47:18709] [[63529,0],0] reset_usage: ignoring proc [[63529,1],0] > [r47:18709] [[63529,0],0] reset_usage: ignoring proc [[63529,1],1] > [r47:18709] BINDING PROC [[63529,1],0] TO Core NUMBER 0 > [r47:18709] [[63529,0],0] BOUND PROC [[63529,1],0] TO 0[Core:0] on node r47 > [r47:18709] BINDING PROC [[63529,1],1] TO Core NUMBER 1 > [r47:18709] [[63529,0],0] BOUND PROC [[63529,1],1] TO 1[Core:1] on node r47 > r47 > r47 > [r47:18709] mca: base: close: component resilient closed > [r47:18709] mca: base: close: unloading component resilient > [r47:18709] mca: base: close: component rank_file closed > [r47:18709] mca: base: close: unloading component rank_file > [r47:18709] mca: base: close: component staged closed > [r47:18709] mca: base: close: unloading component staged > [r47:18709] mca: base: close: component ppr closed > [r47:18709] mca: base: close: unloading component ppr > [r47:18709] mca: base: close: component seq closed > [r47:18709] mca: base: close: unloading component seq > [r47:18709] mca: base: close: component round_robin closed > [r47:18709] mca: base: close: unloading component round_robin > [r47:18709] mca: base: close: component mindist closed > [r47:18709] mca: base: close: unloading component mindist > > With 1.10.2: > > $ /apps/openmpi/1.10.2/bin/mpirun -np 2 -mca rmaps_base_verbose 1000 > hostname > [r47:18733] mca: base: components_register: registering rmaps components > [r47:18733] mca: base: components_register: found loaded component resilient > [r47:18733] mca: base: components_register: component resilient register > function successful > [r47:18733] mca: base: components_register: found loaded component rank_file > [r47:18733] mca: base: components_register: component rank_file register > function successful > [r47:18733] mca: base: components_register: found loaded component staged > [r47:18733] mca: base: components_register: component staged has no register > or open function > [r47:18733] mca: base: components_register: found loaded component ppr > [r47:18733] mca: base: components_register: component ppr register function > successful > [r47:18733] mca: base: components_register: found loaded component seq > [r47:18733] mca: base: components_register: component seq register function > successful > [r47:18733] mca: base: components_register: found loaded component > round_robin > [r47:18733] mca: base: components_register: component round_robin register > function successful > [r47:18733] mca: base: components_register: found loaded component mindist > [r47:18733] mca: base: components_register: component mindist register > function successful > [r47:18733] [[63505,0],0] rmaps:base set policy with core > [r47:18733] mca: base: components_open: opening rmaps components > [r47:18733] mca: base: components_open: found loaded component resilient > [r47:18733] mca: base: components_open: component resilient open function > successful > [r47:18733] mca: base: components_open: found loaded component rank_file > [r47:18733] mca: base: components_open: component rank_file open function > successful > [r47:18733] mca: base: components_open: found loaded component staged > [r47:18733] mca: base: components_open: component staged open function > successful > [r47:18733] mca: base: components_open: found loaded component ppr > [r47:18733] mca: base: components_open: component ppr open function > successful > [r47:18733] mca: base: components_open: found loaded component seq > [r47:18733] mca: base: components_open: component seq open function > successful > [r47:18733] mca: base: components_open: found loaded component round_robin > [r47:18733] mca: base: components_open: component round_robin open function > successful > [r47:18733] mca: base: components_open: found loaded component mindist > [r47:18733] mca: base: components_open: component mindist open function > successful > [r47:18733] mca:rmaps:select: checking available component resilient > [r47:18733] mca:rmaps:select: Querying component [resilient] > [r47:18733] mca:rmaps:select: checking available component rank_file > [r47:18733] mca:rmaps:select: Querying component [rank_file] > [r47:18733] mca:rmaps:select: checking available component staged > [r47:18733] mca:rmaps:select: Querying component [staged] > [r47:18733] mca:rmaps:select: checking available component ppr > [r47:18733] mca:rmaps:select: Querying component [ppr] > [r47:18733] mca:rmaps:select: checking available component seq > [r47:18733] mca:rmaps:select: Querying component [seq] > [r47:18733] mca:rmaps:select: checking available component round_robin > [r47:18733] mca:rmaps:select: Querying component [round_robin] > [r47:18733] mca:rmaps:select: checking available component mindist > [r47:18733] mca:rmaps:select: Querying component [mindist] > [r47:18733] [[63505,0],0]: Final mapper priorities > [r47:18733] Mapper: ppr Priority: 90 > [r47:18733] Mapper: seq Priority: 60 > [r47:18733] Mapper: resilient Priority: 40 > [r47:18733] Mapper: mindist Priority: 20 > [r47:18733] Mapper: round_robin Priority: 10 > [r47:18733] Mapper: staged Priority: 5 > [r47:18733] Mapper: rank_file Priority: 0 > [r47:18733] mca:rmaps: mapping job [63505,1] > [r47:18733] mca:rmaps: creating new map for job [63505,1] > [r47:18733] mca:rmaps: nprocs 2 > [r47:18733] mca:rmaps mapping given - using default > [r47:18733] mca:rmaps:ppr: job [63505,1] not using ppr mapper > [r47:18733] mca:rmaps:seq: job [63505,1] not using seq mapper > [r47:18733] mca:rmaps:resilient: cannot perform initial map of job [63505,1] > - no fault groups > [r47:18733] mca:rmaps:mindist: job [63505,1] not using mindist mapper > [r47:18733] mca:rmaps:rr: mapping job [63505,1] > [r47:18733] AVAILABLE NODES FOR MAPPING: > [r47:18733] node: r47 daemon: 0 > [r47:18733] node: r57 daemon: 1 > [r47:18733] node: r58 daemon: 2 > [r47:18733] node: r59 daemon: 3 > [r47:18733] mca:rmaps:rr: mapping no-span by Core for job [63505,1] slots 64 > num_procs 2 > [r47:18733] mca:rmaps:rr: found 16 Core objects on node r47 > [r47:18733] mca:rmaps:rr: assigning proc to object 0 > -------------------------------------------------------------------------- > A request for multiple cpus-per-proc was given, but a directive > was also give to map to an object level that has less cpus than > requested ones: > > #cpus-per-proc: 1 > number of cpus: 0 > map-by: BYCORE:NOOVERSUBSCRIBE > > Please specify a mapping level that has more cpus, or else let us > define a default mapping that will allow multiple cpus-per-proc. > -------------------------------------------------------------------------- > [r47:18733] mca: base: close: component resilient closed > [r47:18733] mca: base: close: unloading component resilient > [r47:18733] mca: base: close: component rank_file closed > [r47:18733] mca: base: close: unloading component rank_file > [r47:18733] mca: base: close: component staged closed > [r47:18733] mca: base: close: unloading component staged > [r47:18733] mca: base: close: component ppr closed > [r47:18733] mca: base: close: unloading component ppr > [r47:18733] mca: base: close: component seq closed > [r47:18733] mca: base: close: unloading component seq > [r47:18733] mca: base: close: component round_robin closed > [r47:18733] mca: base: close: unloading component round_robin > [r47:18733] mca: base: close: component mindist closed > [r47:18733] mca: base: close: unloading component mindist > > There are both in the same PBS Pro job. And the cpuset definitely has all > cores available: > > $ cat /cgroup/cpuset/pbspro/4347646.r-man2/cpuset.cpus > 0-15 > > Is there something here I'm missing? > > Cheers, > Ben > > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: http://www.open-mpi.org/community/lists/users/2016/01/28393.php > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/usersLink to this post: http://www.open-mpi.org/community/lists/users/2016/01/28397.php