[OMPI users] Slurm or OpenMPI error?

2024-07-01 Thread Mccall, Kurt E. (MSFC-EV41) via users
Using OpenMPI 5.0.3 and Slurm slurm 20.11.8.

Is this error message issued by Slurm or by OpenMPI?  A google search on the 
error message yielded nothing.

--
At least one of the requested hosts is not included in the current
allocation.

   Missing requested host: n001^X

Please check your allocation or your request.
--



Following that error, MPI_Comm_Spawn failed on the named node, n001.


[n001:0] *** An error occurred in MPI_Comm_spawn
[n001:0] *** reported by process [59578,0]
[n001:0] *** on communicator MPI_COMM_SELF
[n001:0] *** MPI_ERR_UNKNOWN: unknown error
[n001:0] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now 
abort,
[n001:0] ***and MPI will try to terminate your MPI job as well)
^@1 more process has sent help message help-mpi-errors.txt / 
mpi_errors_are_fatal
^@1 more process has sent help message help-mpi-errors.txt / 
mpi_errors_are_fatal

Thanks,
Kurt


Re: [OMPI users] [EXTERNAL] Slurm or OpenMPI error?

2024-07-01 Thread Pritchard Jr., Howard via users
Hello Kurt,

The host name looks a little odd.  Do you by chance have a reproducer and 
instructions on how you’re running it that we could try?

Howard

From: users  on behalf of "Mccall, Kurt E. 
(MSFC-EV41) via users" 
Reply-To: Open MPI Users 
Date: Monday, July 1, 2024 at 9:36 AM
To: "OpenMpi User List (users@lists.open-mpi.org)" 
Cc: "Mccall, Kurt E. (MSFC-EV41)" 
Subject: [EXTERNAL] [OMPI users] Slurm or OpenMPI error?

Using OpenMPI 5.0.3 and Slurm slurm 20.11.8.

Is this error message issued by Slurm or by OpenMPI?  A google search on the 
error message yielded nothing.

--
At least one of the requested hosts is not included in the current
allocation.

   Missing requested host: n001^X

Please check your allocation or your request.
--



Following that error, MPI_Comm_Spawn failed on the named node, n001.


[n001:0] *** An error occurred in MPI_Comm_spawn
[n001:0] *** reported by process [59578,0]
[n001:0] *** on communicator MPI_COMM_SELF
[n001:0] *** MPI_ERR_UNKNOWN: unknown error
[n001:0] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now 
abort,
[n001:0] ***and MPI will try to terminate your MPI job as well)
^@1 more process has sent help message help-mpi-errors.txt / 
mpi_errors_are_fatal
^@1 more process has sent help message help-mpi-errors.txt / 
mpi_errors_are_fatal

Thanks,
Kurt


[OMPI users] Slurm or OpenMPI error?

2024-07-01 Thread Mccall, Kurt E. (MSFC-EV41) via users
Howard,

I don’t know where that ^X following the hostname came from.   The node is 
definitely named n001.I will try to create a reproducer.

Thanks,
Kurt

From: Pritchard Jr., Howard 
Sent: Monday, July 1, 2024 11:03 AM
To: Open MPI Users 
Cc: Mccall, Kurt E. (MSFC-EV41) 
Subject: Re: [EXTERNAL] [OMPI users] Slurm or OpenMPI error?

Hello Kurt,

The host name looks a little odd.  Do you by chance have a reproducer and 
instructions on how you’re running it that we could try?

Howard

From: users 
mailto:users-boun...@lists.open-mpi.org>> on 
behalf of "Mccall, Kurt E. (MSFC-EV41) via users" 
mailto:users@lists.open-mpi.org>>
Reply-To: Open MPI Users 
mailto:users@lists.open-mpi.org>>
Date: Monday, July 1, 2024 at 9:36 AM
To: "OpenMpi User List 
(users@lists.open-mpi.org)" 
mailto:users@lists.open-mpi.org>>
Cc: "Mccall, Kurt E. (MSFC-EV41)" 
mailto:kurt.e.mcc...@nasa.gov>>
Subject: [EXTERNAL] [OMPI users] Slurm or OpenMPI error?

Using OpenMPI 5.0.3 and Slurm slurm 20.11.8.

Is this error message issued by Slurm or by OpenMPI?  A google search on the 
error message yielded nothing.

--
At least one of the requested hosts is not included in the current
allocation.

   Missing requested host: n001^X

Please check your allocation or your request.
--



Following that error, MPI_Comm_Spawn failed on the named node, n001.


[n001:0] *** An error occurred in MPI_Comm_spawn
[n001:0] *** reported by process [59578,0]
[n001:0] *** on communicator MPI_COMM_SELF
[n001:0] *** MPI_ERR_UNKNOWN: unknown error
[n001:0] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now 
abort,
[n001:0] ***and MPI will try to terminate your MPI job as well)
^@1 more process has sent help message help-mpi-errors.txt / 
mpi_errors_are_fatal
^@1 more process has sent help message help-mpi-errors.txt / 
mpi_errors_are_fatal

Thanks,
Kurt


[OMPI users] Slurm or OpenMPI error?

2024-07-01 Thread Mccall, Kurt E. (MSFC-EV41) via users
Howard,

I should note that this code ran fine up to the point that our sysadmins 
updated something on the cluster.
That makes me think it is a configuration issue, and that it wouldn’t give you 
any insight if you ran my
reproducer.   It would succeed for you and still fail for me.

What do you think?   I’ll try to get some info from the sysadmins about what 
they changed.

Thanks,
Kurt

From: Pritchard Jr., Howard 
Sent: Monday, July 1, 2024 11:03 AM
To: Open MPI Users 
Cc: Mccall, Kurt E. (MSFC-EV41) 
Subject: Re: [EXTERNAL] [OMPI users] Slurm or OpenMPI error?

Hello Kurt,

The host name looks a little odd.  Do you by chance have a reproducer and 
instructions on how you’re running it that we could try?

Howard

From: users 
mailto:users-boun...@lists.open-mpi.org>> on 
behalf of "Mccall, Kurt E. (MSFC-EV41) via users" 
mailto:users@lists.open-mpi.org>>
Reply-To: Open MPI Users 
mailto:users@lists.open-mpi.org>>
Date: Monday, July 1, 2024 at 9:36 AM
To: "OpenMpi User List 
(users@lists.open-mpi.org)" 
mailto:users@lists.open-mpi.org>>
Cc: "Mccall, Kurt E. (MSFC-EV41)" 
mailto:kurt.e.mcc...@nasa.gov>>
Subject: [EXTERNAL] [OMPI users] Slurm or OpenMPI error?

Using OpenMPI 5.0.3 and Slurm slurm 20.11.8.

Is this error message issued by Slurm or by OpenMPI?  A google search on the 
error message yielded nothing.

--
At least one of the requested hosts is not included in the current
allocation.

   Missing requested host: n001^X

Please check your allocation or your request.
--



Following that error, MPI_Comm_Spawn failed on the named node, n001.


[n001:0] *** An error occurred in MPI_Comm_spawn
[n001:0] *** reported by process [59578,0]
[n001:0] *** on communicator MPI_COMM_SELF
[n001:0] *** MPI_ERR_UNKNOWN: unknown error
[n001:0] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now 
abort,
[n001:0] ***and MPI will try to terminate your MPI job as well)
^@1 more process has sent help message help-mpi-errors.txt / 
mpi_errors_are_fatal
^@1 more process has sent help message help-mpi-errors.txt / 
mpi_errors_are_fatal

Thanks,
Kurt


[OMPI users] Invalid -L flag added to aprun

2024-07-01 Thread Borchert, Christopher B ERDC-RDE-ITL-MS CIV via users
On a Cray XC (requiring aprun launcher to get from batch node to compute
node), 4.0.5 works but 4.1.1 and 4.1.6 do not (even on a single node). The
newer ones throw this:
--
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--

On all 3 when I add -d to mpirun, they show aprun is being called. However,
the 2 newer versions add an invalid flag: -L. Doesn't matter if the -L is
followed by a batch node name or a compute node name.

4.0.5:
[batch7:78642] plm:alps: aprun -n 1 -N 1 -cc none -e
PMI_NO_PREINITIALIZE=1 -e PMI_NO_FORK=1 -e OMPI_NO_USE_CRAY_PMI=1
orted -mca orte_debug 1 -mca ess_base_jobid 3787849728 -mca ess_base_vpid 1
-mca ess_base_num_procs 2 -mca orte_node_regex batch[1:7],[3:132]@0(2) -mca
orte_hnp_uri 3787849728.0;tcp://10.128.13.251:34149

4.1.1:
[batch7:75094] plm:alps: aprun -n 1 -N 1 -cc none -e
PMI_NO_PREINITIALIZE=1 -e PMI_NO_FORK=1 -e OMPI_NO_USE_CRAY_PMI=1 -L batch7
orted -mca orte_debug 1 -mca ess_base_jobid 4154589184 -mca ess_base_vpid 1
-mca ess_base_num_procs 2 -mca orte_node_regex mpirun,batch[1:7]@0(2) -mca
orte_hnp_uri 4154589184.0;tcp://10.128.13.251:56589
aprun: -L node_list contains an invalid entry

4.1.6:
[batch20:43065] plm:alps: aprun -n 1 -N 1 -cc none -e
PMI_NO_PREINITIALIZE=1 -e PMI_NO_FORK=1 -e OMPI_NO_USE_CRAY_PMI=1 -L
nid00140 orted -mca orte_debug 1 -mca ess_base_jobid 115474432 -mca
ess_base_vpid 1 -mca ess_base_num_procs 2 -mca orte_node_regex
batch[2:20],nid[5:140]@0(2) -mca orte_hnp_uri
115474432.0;tcp://10.128.1.39:51455
aprun: -L node_list contains an invalid entry

How can I get this -L argument removed?

Thanks, Chris


smime.p7s
Description: S/MIME cryptographic signature


Re: [OMPI users] [EXTERNAL] Invalid -L flag added to aprun

2024-07-01 Thread Pritchard Jr., Howard via users
Hi Christoph,

First a big caveat and disclaimer.  I'm not sure if any Open MPI developers 
have access any longer to Cray XC systems, so all I can do is make suggestions.

What's probably happening is orte is thinking it is going to fork off the 
application processes on the head node itself.  That isn't going to work for XC 
aries network.
I'm not sure what would have changed between the orte in 4.0.x and 4.1.x to 
cause this difference but could you set the following ORTE MCA parameter and 
see if this problem goes away?

export ORTE_MCA_ras_base_launch_orted_on_hn=1

What batch scheduler is your system using?

Howard

On 7/1/24, 2:11 PM, "users on behalf of Borchert, Christopher B 
ERDC-RDE-ITL-MS CIV via users" mailto:users-boun...@lists.open-mpi.org> on behalf of users@lists.open-mpi.org 
> wrote:


On a Cray XC (requiring aprun launcher to get from batch node to compute
node), 4.0.5 works but 4.1.1 and 4.1.6 do not (even on a single node). The
newer ones throw this:
--
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--


On all 3 when I add -d to mpirun, they show aprun is being called. However,
the 2 newer versions add an invalid flag: -L. Doesn't matter if the -L is
followed by a batch node name or a compute node name.


4.0.5:
[batch7:78642] plm:alps: aprun -n 1 -N 1 -cc none -e
PMI_NO_PREINITIALIZE=1 -e PMI_NO_FORK=1 -e OMPI_NO_USE_CRAY_PMI=1
orted -mca orte_debug 1 -mca ess_base_jobid 3787849728 -mca ess_base_vpid 1
-mca ess_base_num_procs 2 -mca orte_node_regex batch[1:7],[3:132]@0(2) -mca
orte_hnp_uri 3787849728.0;tcp://10.128.13.251:34149


4.1.1:
[batch7:75094] plm:alps: aprun -n 1 -N 1 -cc none -e
PMI_NO_PREINITIALIZE=1 -e PMI_NO_FORK=1 -e OMPI_NO_USE_CRAY_PMI=1 -L batch7
orted -mca orte_debug 1 -mca ess_base_jobid 4154589184 -mca ess_base_vpid 1
-mca ess_base_num_procs 2 -mca orte_node_regex mpirun,batch[1:7]@0(2) -mca
orte_hnp_uri 4154589184.0;tcp://10.128.13.251:56589
aprun: -L node_list contains an invalid entry


4.1.6:
[batch20:43065] plm:alps: aprun -n 1 -N 1 -cc none -e
PMI_NO_PREINITIALIZE=1 -e PMI_NO_FORK=1 -e OMPI_NO_USE_CRAY_PMI=1 -L
nid00140 orted -mca orte_debug 1 -mca ess_base_jobid 115474432 -mca
ess_base_vpid 1 -mca ess_base_num_procs 2 -mca orte_node_regex
batch[2:20],nid[5:140]@0(2) -mca orte_hnp_uri
115474432.0;tcp://10.128.1.39:51455
aprun: -L node_list contains an invalid entry


How can I get this -L argument removed?


Thanks, Chris