I am trying to launch a number of manager processes, one per node, and then have each of those managers spawn, on its own same node, a number of workers. For this example, I have 2 managers and 2 workers per manager. I'm following the instructions at this link
https://stackoverflow.com/questions/47743425/controlling-node-mapping-of-mpi-comm-spawn to force one manager process per node. Here is my PBS/Torque qsub command: $ qsub -V -j oe -e ./stdio -o ./stdio -f -X -N MyManagerJob -l nodes=2:ppn=3 MyManager.bash I expect "-l nodes=2:ppn=3" to reserve 2 nodes with 3 slots on each (one slot for the manager and the other two for the separately spawned workers). The first argument is a lower-case L, not a one. Here is my mpiexec command within the MyManager.bash script. mpiexec --enable-recovery --display-map --display-allocation --mca mpi_param_check 1 --v --x DISPLAY --np 2 --map-by ppr:1:node MyManager.exe I expect "--map-by ppr:1:node" to cause OpenMpi to launch exactly one manager on each node. When the first worker is spawned vi MPI_Comm_spawn(), OpenMpi reports: ====================== ALLOCATED NODES ====================== n002: flags=0x11 slots=3 max_slots=0 slots_inuse=3 state=UP n001: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP ================================================================= -------------------------------------------------------------------------- There are no allocated resources for the application: ./MyWorker that match the requested mapping: -host: n001.cluster.com Verify that you have mapped the allocated resources properly for the indicated specification. -------------------------------------------------------------------------- [n001:14883] *** An error occurred in MPI_Comm_spawn [n001:14883] *** reported by process [1897594881,1] [n001:14883] *** on communicator MPI_COMM_SELF [n001:14883] *** MPI_ERR_SPAWN: could not spawn processes It the banner above, it clearly states that node n001 has 3 slots reserved and only one slot in used at time of the spawn. Not sure why it reports that there are no resources for it. I've tried compiling OpenMpi 4.0 both with and without Torque support, and I've tried using a an explicit host file (or not), but the error is unchanged. Any ideas? My cluster is running CentOS 7.4 and I am using the Portland Group C++ compiler.