I am trying to launch a number of manager processes, one per node, and then have
each of those managers spawn, on its own same node, a number of workers.   For 
this example,
I have 2 managers and 2 workers per manager.  I'm following the instructions at 
this link

https://stackoverflow.com/questions/47743425/controlling-node-mapping-of-mpi-comm-spawn

to force one manager process per node.


Here is my PBS/Torque qsub command:

$ qsub -V -j oe -e ./stdio -o ./stdio -f -X -N MyManagerJob -l nodes=2:ppn=3  
MyManager.bash

I expect "-l nodes=2:ppn=3" to reserve 2 nodes with 3 slots on each (one slot 
for the manager and the other two for the separately spawned workers).  The 
first  argument
is a lower-case L, not a one.



Here is my mpiexec command within the MyManager.bash script.

mpiexec --enable-recovery --display-map --display-allocation --mca 
mpi_param_check 1 --v --x DISPLAY --np 2  --map-by ppr:1:node  MyManager.exe

I expect "--map-by ppr:1:node" to cause OpenMpi to launch exactly one manager 
on each node.



When the first worker is spawned vi MPI_Comm_spawn(), OpenMpi reports:

======================   ALLOCATED NODES   ======================
        n002: flags=0x11 slots=3 max_slots=0 slots_inuse=3 state=UP
        n001: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP
=================================================================
--------------------------------------------------------------------------
There are no allocated resources for the application:
  ./MyWorker
that match the requested mapping:
  -host: n001.cluster.com

Verify that you have mapped the allocated resources properly for the
indicated specification.
--------------------------------------------------------------------------
[n001:14883] *** An error occurred in MPI_Comm_spawn
[n001:14883] *** reported by process [1897594881,1]
[n001:14883] *** on communicator MPI_COMM_SELF
[n001:14883] *** MPI_ERR_SPAWN: could not spawn processes



It the banner above, it clearly states that node n001 has 3 slots reserved
and only one slot in used at time of the spawn.   Not sure why it reports
that there are no resources for it.

I've tried compiling OpenMpi 4.0 both with and without Torque support, and
I've tried using a an explicit host file (or not), but the error is unchanged.
Any ideas?

My cluster is running CentOS 7.4 and I am using the Portland Group C++ compiler.

Reply via email to