Howard,

I should note that this code ran fine up to the point that our sysadmins 
updated something on the cluster.
That makes me think it is a configuration issue, and that it wouldn’t give you 
any insight if you ran my
reproducer.   It would succeed for you and still fail for me.

What do you think?   I’ll try to get some info from the sysadmins about what 
they changed.

Thanks,
Kurt

From: Pritchard Jr., Howard <howa...@lanl.gov>
Sent: Monday, July 1, 2024 11:03 AM
To: Open MPI Users <users@lists.open-mpi.org>
Cc: Mccall, Kurt E. (MSFC-EV41) <kurt.e.mcc...@nasa.gov>
Subject: Re: [EXTERNAL] [OMPI users] Slurm or OpenMPI error?

Hello Kurt,

The host name looks a little odd.  Do you by chance have a reproducer and 
instructions on how you’re running it that we could try?

Howard

From: users 
<users-boun...@lists.open-mpi.org<mailto:users-boun...@lists.open-mpi.org>> on 
behalf of "Mccall, Kurt E. (MSFC-EV41) via users" 
<users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>>
Reply-To: Open MPI Users 
<users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>>
Date: Monday, July 1, 2024 at 9:36 AM
To: "OpenMpi User List 
(users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>)" 
<users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>>
Cc: "Mccall, Kurt E. (MSFC-EV41)" 
<kurt.e.mcc...@nasa.gov<mailto:kurt.e.mcc...@nasa.gov>>
Subject: [EXTERNAL] [OMPI users] Slurm or OpenMPI error?

Using OpenMPI 5.0.3 and Slurm slurm 20.11.8.

Is this error message issued by Slurm or by OpenMPI?  A google search on the 
error message yielded nothing.

--------------------------------------------------------------------------
At least one of the requested hosts is not included in the current
allocation.

   Missing requested host: n001^X

Please check your allocation or your request.
--------------------------------------------------------------------------



Following that error, MPI_Comm_Spawn failed on the named node, n001.


[n001:00000] *** An error occurred in MPI_Comm_spawn
[n001:00000] *** reported by process [595787777,0]
[n001:00000] *** on communicator MPI_COMM_SELF
[n001:00000] *** MPI_ERR_UNKNOWN: unknown error
[n001:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now 
abort,
[n001:00000] ***    and MPI will try to terminate your MPI job as well)
^@1 more process has sent help message help-mpi-errors.txt / 
mpi_errors_are_fatal
^@1 more process has sent help message help-mpi-errors.txt / 
mpi_errors_are_fatal

Thanks,
Kurt

Reply via email to