Hi,

MPI_Comm_spawn() is failing with the error message "All nodes which are 
allocated for this job are already filled".   I compiled OpenMpi 4.0.1 with the 
Portland Group C++  compiler, v. 19.5.0, both with and without Torque/Maui 
support.   I thought that not using Torque/Maui support would give me finer 
control over where MPI_Comm_spawn() places the processes, but the failure 
message was the same in either case.  Perhaps Torque is interfering with 
process creation somehow?

For the pared-down test code, I am following the instructions here to make 
mpiexec create exactly one manager process on a remote node, and then forcing 
that manager to spawn one worker process on the same remote node:

https://stackoverflow.com/questions/47743425/controlling-node-mapping-of-mpi-comm-spawn




Here is the full error message.   Note the Max Slots: 0 message therein (?):

Data for JOB [39020,1] offset 0 Total slots allocated 22

========================   JOB MAP   ========================

Data for node: n001    Num slots: 2    Max slots: 2    Num procs: 1
        Process OMPI jobid: [39020,1] App: 0 Process rank: 0 Bound: N/A

=============================================================
Data for JOB [39020,1] offset 0 Total slots allocated 22

========================   JOB MAP   ========================

Data for node: n001    Num slots: 2    Max slots: 0    Num procs: 1
        Process OMPI jobid: [39020,1] App: 0 Process rank: 0 Bound: socket 
0[core 0[hwt 0]]:[B/././././././././.][./././././././././.]

=============================================================
--------------------------------------------------------------------------
All nodes which are allocated for this job are already filled.
--------------------------------------------------------------------------
[n001:08114] *** An error occurred in MPI_Comm_spawn
[n001:08114] *** reported by process [2557214721,0]
[n001:08114] *** on communicator MPI_COMM_SELF
[n001:08114] *** MPI_ERR_SPAWN: could not spawn processes
[n001:08114] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now 
abort,
[n001:08114] ***    and potentially your MPI job)




Here is my mpiexec command:

mpiexec --display-map --v --x DISPLAY -hostfile MyNodeFile --np 1 -map-by 
ppr:1:node SpawnTestManager




Here is my hostfile "MyNodeFile":

n001.cluster.com slots=2 max_slots=2




Here is my SpawnTestManager code:

<code>
#include <iostream>
#include <string>
#include <cstdio>

#ifdef SUCCESS
#undef SUCCESS
#endif
#include "/opt/openmpi_pgc_tm/include/mpi.h"

using std::string;
using std::cout;
using std::endl;

int main(int argc, char *argv[])
{
    int rank, world_size;
    char *argv2[2];
    MPI_Comm mpi_comm;
    MPI_Info info;
    char host[MPI_MAX_PROCESSOR_NAME + 1];
    int host_name_len;

    string worker_cmd = "SpawnTestWorker";
    string host_name = "n001.cluster.com";

    argv2[0] = "dummy_arg";
    argv2[1] = NULL;

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &world_size);

    MPI_Get_processor_name(host, &host_name_len);
    cout << "Host name from MPI_Get_processor_name is " << host << endl;

   char info_str[64];
    sprintf(info_str, "ppr:%d:node", 1);
    MPI_Info_create(&info);
    MPI_Info_set(info, "host", host_name.c_str());
    MPI_Info_set(info, "map-by", info_str);

    MPI_Comm_spawn(worker_cmd.c_str(), argv2, 1, info, rank, MPI_COMM_SELF,
        &mpi_comm, MPI_ERRCODES_IGNORE);
    MPI_Comm_set_errhandler(mpi_comm, MPI::ERRORS_THROW_EXCEPTIONS);

    std::cout << "Manager success!" << std::endl;

    MPI_Finalize();
    return 0;
}
</code>



Here is my SpawnTestWorker code:

<code>
#include "/opt/openmpi_pgc_tm/include/mpi.h"
#include <iostream>

int main(int argc, char *argv[])
{
    int world_size, rank;
    MPI_Comm manager_intercom;

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &world_size);

    MPI_Comm_get_parent(&manager_intercom);
    MPI_Comm_set_errhandler(manager_intercom, MPI::ERRORS_THROW_EXCEPTIONS);

    std::cout << "Worker success!" << std::endl;

    MPI_Finalize();
    return 0;
}
</code>

My config.log can be found here:  
https://gist.github.com/kmccall882/e26bc2ea58c9328162e8959b614a6fce.js

I've attached the other info requested at on the help page, except the output 
of "ompi_info -v ompi full --parsable".   My version of ompi_info doesn't 
accept the "ompi full" arguments, and the "-all" arg doesn't produce much 
output.

Thanks for your help,
Kurt









_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to