Hi Chris, Sorry for the delay. I wanted to first double check with my own build of a 4.1.x branch.
Could you try again with two things. One if you didn't configure Open MPI with --enable-debug option could you do that and rebuild? Then, try setting these environment variables and rerunning your test to see if we learn more: export OMPI_MCA_ras_base_verbose=100 export OMPI_MCA_ras_base_launch_orted_on_hn=1 On 7/2/24, 8:09 AM, "Borchert, Christopher B ERDC-RDE-ITL-MS CIV" <christopher.b.borch...@erdc.dren.mil <mailto:christopher.b.borch...@erdc.dren.mil>> wrote: Thanks Howard. I don't find the env var changes the behavior. I'm using PBS Pro. Chris -----Original Message----- From: Pritchard Jr., Howard <howa...@lanl.gov <mailto:howa...@lanl.gov>> Sent: Monday, July 1, 2024 3:43 PM To: Open MPI Users <users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>> Cc: Borchert, Christopher B ERDC-RDE-ITL-MS CIV <christopher.b.borch...@erdc.dren.mil <mailto:christopher.b.borch...@erdc.dren.mil>> Subject: Re: [EXTERNAL] [OMPI users] Invalid -L flag added to aprun Hi Christoph, First a big caveat and disclaimer. I'm not sure if any Open MPI developers have access any longer to Cray XC systems, so all I can do is make suggestions. What's probably happening is orte is thinking it is going to fork off the application processes on the head node itself. That isn't going to work for XC aries network. I'm not sure what would have changed between the orte in 4.0.x and 4.1.x to cause this difference but could you set the following ORTE MCA parameter and see if this problem goes away? export ORTE_MCA_ras_base_launch_orted_on_hn=1 What batch scheduler is your system using? Howard On 7/1/24, 2:11 PM, "users on behalf of Borchert, Christopher B ERDC-RDE-ITL-MS CIV via users" <users-boun...@lists.open-mpi.org <mailto:users-boun...@lists.open-mpi.org> <mailto:users-boun...@lists.open-mpi.org <mailto:users-boun...@lists.open-mpi.org>> on behalf of users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> <mailto:users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>>> wrote: On a Cray XC (requiring aprun launcher to get from batch node to compute node), 4.0.5 works but 4.1.1 and 4.1.6 do not (even on a single node). The newer ones throw this: -------------------------------------------------------------------------- An ORTE daemon has unexpectedly failed after launch and before communicating back to mpirun. This could be caused by a number of factors, including an inability to create a connection back to mpirun due to a lack of common network interfaces and/or no route found between them. Please check network connectivity (including firewalls and network routing requirements). -------------------------------------------------------------------------- On all 3 when I add -d to mpirun, they show aprun is being called. However, the 2 newer versions add an invalid flag: -L. Doesn't matter if the -L is followed by a batch node name or a compute node name. 4.0.5: [batch7:78642] plm:alps: aprun -n 1 -N 1 -cc none -e PMI_NO_PREINITIALIZE=1 -e PMI_NO_FORK=1 -e OMPI_NO_USE_CRAY_PMI=1 orted -mca orte_debug 1 -mca ess_base_jobid 3787849728 -mca ess_base_vpid 1 -mca ess_base_num_procs 2 -mca orte_node_regex batch[1:7],[3:132]@0(2) -mca orte_hnp_uri 3787849728.0;tcp://10.128.13.251:34149 4.1.1: [batch7:75094] plm:alps: aprun -n 1 -N 1 -cc none -e PMI_NO_PREINITIALIZE=1 -e PMI_NO_FORK=1 -e OMPI_NO_USE_CRAY_PMI=1 -L batch7 orted -mca orte_debug 1 -mca ess_base_jobid 4154589184 -mca ess_base_vpid 1 -mca ess_base_num_procs 2 -mca orte_node_regex mpirun,batch[1:7]@0(2) -mca orte_hnp_uri 4154589184.0;tcp://10.128.13.251:56589 aprun: -L node_list contains an invalid entry 4.1.6: [batch20:43065] plm:alps: aprun -n 1 -N 1 -cc none -e PMI_NO_PREINITIALIZE=1 -e PMI_NO_FORK=1 -e OMPI_NO_USE_CRAY_PMI=1 -L nid00140 orted -mca orte_debug 1 -mca ess_base_jobid 115474432 -mca ess_base_vpid 1 -mca ess_base_num_procs 2 -mca orte_node_regex batch[2:20],nid[5:140]@0(2) -mca orte_hnp_uri 115474432.0;tcp://10.128.1.39:51455 aprun: -L node_list contains an invalid entry How can I get this -L argument removed? Thanks, Chris