Howard, I should note that this code ran fine up to the point that our sysadmins updated something on the cluster. That makes me think it is a configuration issue, and that it wouldn’t give you any insight if you ran my reproducer. It would succeed for you and still fail for me.
What do you think? I’ll try to get some info from the sysadmins about what they changed. Thanks, Kurt From: Pritchard Jr., Howard <howa...@lanl.gov> Sent: Monday, July 1, 2024 11:03 AM To: Open MPI Users <users@lists.open-mpi.org> Cc: Mccall, Kurt E. (MSFC-EV41) <kurt.e.mcc...@nasa.gov> Subject: Re: [EXTERNAL] [OMPI users] Slurm or OpenMPI error? Hello Kurt, The host name looks a little odd. Do you by chance have a reproducer and instructions on how you’re running it that we could try? Howard From: users <users-boun...@lists.open-mpi.org<mailto:users-boun...@lists.open-mpi.org>> on behalf of "Mccall, Kurt E. (MSFC-EV41) via users" <users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>> Reply-To: Open MPI Users <users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>> Date: Monday, July 1, 2024 at 9:36 AM To: "OpenMpi User List (users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>)" <users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>> Cc: "Mccall, Kurt E. (MSFC-EV41)" <kurt.e.mcc...@nasa.gov<mailto:kurt.e.mcc...@nasa.gov>> Subject: [EXTERNAL] [OMPI users] Slurm or OpenMPI error? Using OpenMPI 5.0.3 and Slurm slurm 20.11.8. Is this error message issued by Slurm or by OpenMPI? A google search on the error message yielded nothing. -------------------------------------------------------------------------- At least one of the requested hosts is not included in the current allocation. Missing requested host: n001^X Please check your allocation or your request. -------------------------------------------------------------------------- Following that error, MPI_Comm_Spawn failed on the named node, n001. [n001:00000] *** An error occurred in MPI_Comm_spawn [n001:00000] *** reported by process [595787777,0] [n001:00000] *** on communicator MPI_COMM_SELF [n001:00000] *** MPI_ERR_UNKNOWN: unknown error [n001:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [n001:00000] *** and MPI will try to terminate your MPI job as well) ^@1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal ^@1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal Thanks, Kurt