That did it! Thanks Howard! -----Original Message----- From: Pritchard Jr., Howard <howa...@lanl.gov> Sent: Thursday, July 11, 2024 9:14 AM To: Borchert, Christopher B ERDC-RDE-ITL-MS CIV <christopher.b.borch...@erdc.dren.mil>; Open MPI Users <users@lists.open-mpi.org> Subject: Re: [EXTERNAL] [OMPI users] Invalid -L flag added to aprun
Okay, try setting this environment variable and see if the mpirun command works: export OMPI_MCA_ras=alps On 7/11/24, 8:10 AM, "Borchert, Christopher B ERDC-RDE-ITL-MS CIV" <christopher.b.borch...@erdc.dren.mil <mailto:christopher.b.borch...@erdc.dren.mil>> wrote: It’s the same output and the same result: batch13:~> aprun -n 2 -N 1 hostname nid00418 nid00419 batch13:~> aprun -n 2 -N 1 -L nid00418,nid00419 hostname aprun: -L node_list contains an invalid entry Usage: aprun [global_options] [command_options] cmd1 ... Thanks, Chris -----Original Message----- From: Pritchard Jr., Howard <howa...@lanl.gov <mailto:howa...@lanl.gov>> Sent: Thursday, July 11, 2024 9:03 AM To: Borchert, Christopher B ERDC-RDE-ITL-MS CIV <christopher.b.borch...@erdc.dren.mil <mailto:christopher.b.borch...@erdc.dren.mil>>; Open MPI Users <users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>> Subject: Re: [EXTERNAL] [OMPI users] Invalid -L flag added to aprun Hi Chris I wonder if somethings messed up with the way alps is interpreting node names on the system. Could you try doing the following: 1. get a two node allocation on your cluster 2. run aprun -n 2 -N 1 hostname 3. take the hostnames returned then run aprun -n 2 -N 1 -L X,Y hostname Where X= first string returned from the command in step 2 and Y is the second string returned from the command in step 2 On 7/11/24, 7:55 AM, "Borchert, Christopher B ERDC-RDE-ITL-MS CIV" <christopher.b.borch...@erdc.dren.mil <mailto:christopher.b.borch...@erdc.dren.mil> <mailto:christopher.b.borch...@erdc.dren.mil <mailto:christopher.b.borch...@erdc.dren.mil>>> wrote: Thanks Howard. Here is what I got. batch35:/p/work/borchert> mpirun -n 1 -d ./a.out [batch35:62735] procdir: /p/work/borchert/ompi.batch35.34110/pid.62735/0/0 [batch35:62735] jobdir: /p/work/borchert/ompi.batch35.34110/pid.62735/0 [batch35:62735] top: /p/work/borchert/ompi.batch35.34110/pid.62735 [batch35:62735] top: /p/work/borchert/ompi.batch35.34110 [batch35:62735] tmp: /p/work/borchert [batch35:62735] sess_dir_cleanup: job session dir does not exist [batch35:62735] sess_dir_cleanup: top session dir does not exist [batch35:62735] procdir: /p/work/borchert/ompi.batch35.34110/pid.62735/0/0 [batch35:62735] jobdir: /p/work/borchert/ompi.batch35.34110/pid.62735/0 [batch35:62735] top: /p/work/borchert/ompi.batch35.34110/pid.62735 [batch35:62735] top: /p/work/borchert/ompi.batch35.34110 [batch35:62735] tmp: /p/work/borchert [batch35:62735] mca: base: components_register: registering framework ras components [batch35:62735] mca: base: components_register: found loaded component simulator [batch35:62735] mca: base: components_register: component simulator register function successful [batch35:62735] mca: base: components_register: found loaded component slurm [batch35:62735] mca: base: components_register: component slurm register function successful [batch35:62735] mca: base: components_register: found loaded component tm [batch35:62735] mca: base: components_register: component tm register function successful [batch35:62735] mca: base: components_register: found loaded component alps [batch35:62735] mca: base: components_register: component alps register function successful [batch35:62735] mca: base: components_open: opening ras components [batch35:62735] mca: base: components_open: found loaded component simulator [batch35:62735] mca: base: components_open: found loaded component slurm [batch35:62735] mca: base: components_open: component slurm open function successful [batch35:62735] mca: base: components_open: found loaded component tm [batch35:62735] mca: base: components_open: component tm open function successful [batch35:62735] mca: base: components_open: found loaded component alps [batch35:62735] mca: base: components_open: component alps open function successful [batch35:62735] mca:base:select: Auto-selecting ras components [batch35:62735] mca:base:select:( ras) Querying component [simulator] [batch35:62735] mca:base:select:( ras) Querying component [slurm] [batch35:62735] mca:base:select:( ras) Querying component [tm] [batch35:62735] mca:base:select:( ras) Query of component [tm] set priority to 100 [batch35:62735] mca:base:select:( ras) Querying component [alps] [batch35:62735] ras:alps: available for selection [batch35:62735] mca:base:select:( ras) Query of component [alps] set priority to 75 [batch35:62735] mca:base:select:( ras) Selected component [tm] [batch35:62735] mca: base: close: unloading component simulator [batch35:62735] mca: base: close: component slurm closed [batch35:62735] mca: base: close: unloading component slurm [batch35:62735] mca: base: close: unloading component alps [batch35:62735] [[34694,0],0] ras:base:allocate [batch35:62735] [[34694,0],0] ras:tm:allocate:discover: got hostname nid01243 [batch35:62735] [[34694,0],0] ras:tm:allocate:discover: not found -- added to list [batch35:62735] [[34694,0],0] ras:tm:allocate:discover: got hostname nid01244 [batch35:62735] [[34694,0],0] ras:tm:allocate:discover: not found -- added to list [batch35:62735] [[34694,0],0] ras:base:node_insert inserting 2 nodes [batch35:62735] [[34694,0],0] ras:base:node_insert node nid01243 slots 1 [batch35:62735] [[34694,0],0] ras:base:node_insert node nid01244 slots 1 ====================== ALLOCATED NODES ====================== nid01243: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UP nid01244: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UP ================================================================= [batch35:62735] plm:alps: final top-level argv: [batch35:62735] plm:alps: aprun -n 2 -N 1 -cc none -e PMI_NO_PREINITIALIZE=1 -e PMI_NO_FORK=1 -e OMPI_NO_USE_CRAY_PMI=1 -L nid01243,nid01244 orted -mca orte_debug 1 -mca ess_base_jobid 2273705984 -mca ess_base_vpid 1 -mca ess_base_num_procs 3 -mca orte_node_regex batch[2:35],nid[5:1243-1244]@0(3) -mca orte_hnp_uri 2273705984.0;tcp://10.128.8.181:56687 aprun: -L node_list contains an invalid entry Usage: aprun [global_options] [command_options] cmd1 [: [command_options] cmd2 [: ...] ] [--help] [--version] --help Print this help information and exit --version Print version information : Separate binaries for MPMD mode (Multiple Program, Multiple Data) Global Options: -b, --bypass-app-transfer Bypass application transfer to compute node -B, --batch-args Get values from Batch reservation for -n, -N, -d, and -m -C, --reconnect Reconnect fanout control tree around failed nodes -D, --debug level Debug level bitmask (0-7) -e, --environment-override env Set an environment variable on the compute nodes Must use format VARNAME=value Set multiple env variables using multiple -e args -P, --pipes pipes Write[,read] pipes (not applicable for general use) -p, --protection-domain pdi Protection domain identifier -q, --quiet Quiet mode; suppress aprun non-fatal messages -R, --relaunch max_shrink Relaunch application; max_shrink is zero or more maximum PEs to shrink for a relaunch -T, --sync-output Use synchronous TTY -t, --cpu-time-limit sec Per PE CPU time limit in seconds (default unlimited) --wdir wdir Application working directory (default current directory) -Z, --zone-sort-secs secs Perform periodic memory zone sort every secs seconds -z, --zone-sort Perform memory zone sort before application launch Command Options: -a, --architecture arch Architecture type (only XT currently supported) --cc, --cpu-binding cpu_list CPU binding list or keyword ([cpu#[,cpu# | cpu1-cpu2] | x]...] | keyword) --cp, --cpu-binding-file file CPU binding placement filename -d, --cpus-per-pe depth Number of CPUs allocated per PE (number of threads) -E, --exclude-node-list node_list List of nodes to exclude from placement --exclude-node-list-file node_list_file File with a list of nodes to exclude from placement -F, --access-mode flag Exclusive or share node resources flag -j, --cpus-per-cu CPUs CPUs to use per Compute Unit (CU) -L, --node-list node_list Manual placement list (node[,node | node1-node2]...) -l, --node-list-file node_list_file File with manual placement list -m, --memory-per-pe size Per PE memory limit in megabytes (default node memory/number of processors) K|M|G suffix supported (16 == 16M == 16 megabytes) Add an 'h' suffix to request per PE huge page memory Add an 's' to the 'h' suffix to make the per PE huge page memory size strict (required) --mpmd-env env Set an environment variable on the compute nodes for a specific MPMD command Must use format VARNAME=value Set multiple env variables using multiple --mpmd-env args -N, --pes-per-node pes PEs per node -n, --pes width Number of PEs requested --p-governor governor_name Specify application performance governor --p-state pstate Specify application p-state in kHz -r, --specialized-cpus CPUs Restrict this many CPUs per node to specialization -S, --pes-per-numa-node pes PEs per NUMA node --ss, --strict-memory-containment Strict memory containment per NUMA node [batch35:62735] [[34694,0],0]:errmgr_default_hnp.c(212) updating exit status to 1 -------------------------------------------------------------------------- An ORTE daemon has unexpectedly failed after launch and before communicating back to mpirun. This could be caused by a number of factors, including an inability to create a connection back to mpirun due to a lack of common network interfaces and/or no route found between them. Please check network connectivity (including firewalls and network routing requirements). -------------------------------------------------------------------------- [batch35:62735] [[34694,0],0] orted:comm:process_commands() Processing Command: ORTE_DAEMON_HALT_VM_CMD [batch35:62735] Job UNKNOWN has launched [batch35:62735] [[34694,0],0] Releasing job data for [34694,1] [batch35:62735] [[34694,0],0] ras:tm:finalize: success (nothing to do) [batch35:62735] mca: base: close: unloading component tm [batch35:62735] sess_dir_finalize: proc session dir does not exist [batch35:62735] sess_dir_finalize: job session dir does not exist [batch35:62735] sess_dir_finalize: jobfam session dir does not exist [batch35:62735] sess_dir_finalize: jobfam session dir does not exist [batch35:62735] sess_dir_finalize: top session dir does not exist [batch35:62735] sess_dir_cleanup: job session dir does not exist [batch35:62735] sess_dir_cleanup: top session dir does not exist [batch35:62735] [[34694,0],0] Releasing job data for [34694,0] [batch35:62735] sess_dir_cleanup: job session dir does not exist [batch35:62735] sess_dir_cleanup: top session dir does not exist exiting with status 1 Chris -----Original Message----- From: Pritchard Jr., Howard <howa...@lanl.gov <mailto:howa...@lanl.gov> <mailto:howa...@lanl.gov <mailto:howa...@lanl.gov>>> Sent: Wednesday, July 10, 2024 12:40 PM To: Borchert, Christopher B ERDC-RDE-ITL-MS CIV <christopher.b.borch...@erdc.dren.mil <mailto:christopher.b.borch...@erdc.dren.mil> <mailto:christopher.b.borch...@erdc.dren.mil <mailto:christopher.b.borch...@erdc.dren.mil>>>; Open MPI Users <users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> <mailto:users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>>> Subject: Re: [EXTERNAL] [OMPI users] Invalid -L flag added to aprun Hi Chris, Sorry for the delay. I wanted to first double check with my own build of a 4.1.x branch. Could you try again with two things. One if you didn't configure Open MPI with --enable-debug option could you do that and rebuild? Then, try setting these environment variables and rerunning your test to see if we learn more: export OMPI_MCA_ras_base_verbose=100 export OMPI_MCA_ras_base_launch_orted_on_hn=1 On 7/2/24, 8:09 AM, "Borchert, Christopher B ERDC-RDE-ITL-MS CIV" <christopher.b.borch...@erdc.dren.mil <mailto:christopher.b.borch...@erdc.dren.mil> <mailto:christopher.b.borch...@erdc.dren.mil <mailto:christopher.b.borch...@erdc.dren.mil>> <mailto:christopher.b.borch...@erdc.dren.mil <mailto:christopher.b.borch...@erdc.dren.mil> <mailto:christopher.b.borch...@erdc.dren.mil <mailto:christopher.b.borch...@erdc.dren.mil>>>> wrote: Thanks Howard. I don't find the env var changes the behavior. I'm using PBS Pro. Chris -----Original Message----- From: Pritchard Jr., Howard <howa...@lanl.gov <mailto:howa...@lanl.gov> <mailto:howa...@lanl.gov <mailto:howa...@lanl.gov>> <mailto:howa...@lanl.gov <mailto:howa...@lanl.gov> <mailto:howa...@lanl.gov <mailto:howa...@lanl.gov>>>> Sent: Monday, July 1, 2024 3:43 PM To: Open MPI Users <users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> <mailto:users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>> <mailto:users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> <mailto:users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>>>> Cc: Borchert, Christopher B ERDC-RDE-ITL-MS CIV <christopher.b.borch...@erdc.dren.mil <mailto:christopher.b.borch...@erdc.dren.mil> <mailto:christopher.b.borch...@erdc.dren.mil <mailto:christopher.b.borch...@erdc.dren.mil>> <mailto:christopher.b.borch...@erdc.dren.mil <mailto:christopher.b.borch...@erdc.dren.mil> <mailto:christopher.b.borch...@erdc.dren.mil <mailto:christopher.b.borch...@erdc.dren.mil>>>> Subject: Re: [EXTERNAL] [OMPI users] Invalid -L flag added to aprun Hi Christoph, First a big caveat and disclaimer. I'm not sure if any Open MPI developers have access any longer to Cray XC systems, so all I can do is make suggestions. What's probably happening is orte is thinking it is going to fork off the application processes on the head node itself. That isn't going to work for XC aries network. I'm not sure what would have changed between the orte in 4.0.x and 4.1.x to cause this difference but could you set the following ORTE MCA parameter and see if this problem goes away? export ORTE_MCA_ras_base_launch_orted_on_hn=1 What batch scheduler is your system using? Howard On 7/1/24, 2:11 PM, "users on behalf of Borchert, Christopher B ERDC-RDE-ITL-MS CIV via users" <users-boun...@lists.open-mpi.org <mailto:users-boun...@lists.open-mpi.org> <mailto:users-boun...@lists.open-mpi.org <mailto:users-boun...@lists.open-mpi.org>> <mailto:users-boun...@lists.open-mpi.org <mailto:users-boun...@lists.open-mpi.org> <mailto:users-boun...@lists.open-mpi.org <mailto:users-boun...@lists.open-mpi.org>>> <mailto:users-boun...@lists.open-mpi.org <mailto:users-boun...@lists.open-mpi.org> <mailto:users-boun...@lists.open-mpi.org <mailto:users-boun...@lists.open-mpi.org>> <mailto:users-boun...@lists.open-mpi.org <mailto:users-boun...@lists.open-mpi.org> <mailto:users-boun...@lists.open-mpi.org <mailto:users-boun...@lists.open-mpi.org>>>> on behalf of users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> <mailto:users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>> <mailto:users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> <mailto:users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>>> <mailto:users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> <mailto:users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>> <mailto:users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> <mailto:users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>>>>> wrote: On a Cray XC (requiring aprun launcher to get from batch node to compute node), 4.0.5 works but 4.1.1 and 4.1.6 do not (even on a single node). The newer ones throw this: -------------------------------------------------------------------------- An ORTE daemon has unexpectedly failed after launch and before communicating back to mpirun. This could be caused by a number of factors, including an inability to create a connection back to mpirun due to a lack of common network interfaces and/or no route found between them. Please check network connectivity (including firewalls and network routing requirements). -------------------------------------------------------------------------- On all 3 when I add -d to mpirun, they show aprun is being called. However, the 2 newer versions add an invalid flag: -L. Doesn't matter if the -L is followed by a batch node name or a compute node name. 4.0.5: [batch7:78642] plm:alps: aprun -n 1 -N 1 -cc none -e PMI_NO_PREINITIALIZE=1 -e PMI_NO_FORK=1 -e OMPI_NO_USE_CRAY_PMI=1 orted -mca orte_debug 1 -mca ess_base_jobid 3787849728 -mca ess_base_vpid 1 -mca ess_base_num_procs 2 -mca orte_node_regex batch[1:7],[3:132]@0(2) -mca orte_hnp_uri 3787849728.0;tcp://10.128.13.251:34149 4.1.1: [batch7:75094] plm:alps: aprun -n 1 -N 1 -cc none -e PMI_NO_PREINITIALIZE=1 -e PMI_NO_FORK=1 -e OMPI_NO_USE_CRAY_PMI=1 -L batch7 orted -mca orte_debug 1 -mca ess_base_jobid 4154589184 -mca ess_base_vpid 1 -mca ess_base_num_procs 2 -mca orte_node_regex mpirun,batch[1:7]@0(2) -mca orte_hnp_uri 4154589184.0;tcp://10.128.13.251:56589 aprun: -L node_list contains an invalid entry 4.1.6: [batch20:43065] plm:alps: aprun -n 1 -N 1 -cc none -e PMI_NO_PREINITIALIZE=1 -e PMI_NO_FORK=1 -e OMPI_NO_USE_CRAY_PMI=1 -L nid00140 orted -mca orte_debug 1 -mca ess_base_jobid 115474432 -mca ess_base_vpid 1 -mca ess_base_num_procs 2 -mca orte_node_regex batch[2:20],nid[5:140]@0(2) -mca orte_hnp_uri 115474432.0;tcp://10.128.1.39:51455 aprun: -L node_list contains an invalid entry How can I get this -L argument removed? Thanks, Chris
smime.p7s
Description: S/MIME cryptographic signature