Hi, I am evaluating OpenMPI 5.0.0 and I am experiencing a race condition when spawning a different number of processes in different nodes.
With: $cat hostfile node00 node01 node02 node03 If I run this code: #include <stdio.h> #include <stdlib.h> #include <mpi.h> int main(int argc, char* argv[]){ MPI_Init(&argc, &argv); MPI_Comm intercomm; int final_nranks = 4, len; char name[MPI_MAX_PROCESSOR_NAME]; MPI_Get_processor_name(name, &len); MPI_Comm_get_parent(&intercomm); if(intercomm == MPI_COMM_NULL){ MPI_Info info; MPI_Info_create(&info); MPI_Info_set(info, "hostfile", "hostfile"); MPI_Info_set(info, "map_by", "ppr:1:node"); MPI_Comm_spawn(argv[0], MPI_ARGV_NULL, final_nranks, info, 0, MPI_COMM_WORLD, &intercomm, MPI_ERRCODES_IGNORE); printf("PARENT %s\n", name); } else { printf("CHILD %s\n", name); } MPI_Finalize(); return 0; } With the command: $ mpirun -np 2 --hostfile hostfile --map-by node ./a.out Sometimes I get this (that it is what I wanted, but without PMIX errors): [node00:281361] PMIX ERROR: ERROR in file prted/pmix/pmix_server_dyn.c at line 1034 [node00:281361] PMIX ERROR: OUT-OF-RESOURCE in file base/bfrop_base_unpack.c at line 1839 PARENT node00 CHILD node00 PARENT node01 CHILD node01 CHILD node02 CHILD node03 However, in other executions I get the following output: [node00:281468] PMIX ERROR: ERROR in file prted/pmix/pmix_server_dyn.c at line 1034 [node00:281468] PMIX ERROR: OUT-OF-RESOURCE in file base/bfrop_base_unpack.c at line 1839 -------------------------------------------------------------------------- PRTE has lost communication with a remote daemon. HNP daemon : [prterun-node00-281468@0,0] on node node00 Remote daemon: [prterun-node00-281468@0,2] on node node01 This is usually due to either a failure of the TCP network connection to the node, or possibly an internal failure of the daemon itself. We cannot recover from this failure, and therefore will terminate the job. -------------------------------------------------------------------------- [node00:00000] *** An error occurred in Socket closed [node00:00000] *** reported by process [3933011970,0] [node00:00000] *** on a NULL communicator [node00:00000] *** Unknown error [node00:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [node00:00000] *** and MPI will try to terminate your MPI job as well) I also submitted the issue in Github: https://github.com/open-mpi/ompi/issues/11421 Any help is appreciatted, even if it is in the shape of hints to hack some parts of the code that may be causing this issue. Thank you.