Re: [OMPI users] [EXTERNAL] Invalid -L flag added to aprun

2024-07-11 Thread Borchert, Christopher B ERDC-RDE-ITL-MS CIV via users
Thanks Howard. Here is what I got.

batch35:/p/work/borchert> mpirun -n 1 -d ./a.out
[batch35:62735] procdir: /p/work/borchert/ompi.batch35.34110/pid.62735/0/0
[batch35:62735] jobdir: /p/work/borchert/ompi.batch35.34110/pid.62735/0
[batch35:62735] top: /p/work/borchert/ompi.batch35.34110/pid.62735
[batch35:62735] top: /p/work/borchert/ompi.batch35.34110
[batch35:62735] tmp: /p/work/borchert
[batch35:62735] sess_dir_cleanup: job session dir does not exist
[batch35:62735] sess_dir_cleanup: top session dir does not exist
[batch35:62735] procdir: /p/work/borchert/ompi.batch35.34110/pid.62735/0/0
[batch35:62735] jobdir: /p/work/borchert/ompi.batch35.34110/pid.62735/0
[batch35:62735] top: /p/work/borchert/ompi.batch35.34110/pid.62735
[batch35:62735] top: /p/work/borchert/ompi.batch35.34110
[batch35:62735] tmp: /p/work/borchert
[batch35:62735] mca: base: components_register: registering framework ras 
components
[batch35:62735] mca: base: components_register: found loaded component simulator
[batch35:62735] mca: base: components_register: component simulator register 
function successful
[batch35:62735] mca: base: components_register: found loaded component slurm
[batch35:62735] mca: base: components_register: component slurm register 
function successful
[batch35:62735] mca: base: components_register: found loaded component tm
[batch35:62735] mca: base: components_register: component tm register function 
successful
[batch35:62735] mca: base: components_register: found loaded component alps
[batch35:62735] mca: base: components_register: component alps register 
function successful
[batch35:62735] mca: base: components_open: opening ras components
[batch35:62735] mca: base: components_open: found loaded component simulator
[batch35:62735] mca: base: components_open: found loaded component slurm
[batch35:62735] mca: base: components_open: component slurm open function 
successful
[batch35:62735] mca: base: components_open: found loaded component tm
[batch35:62735] mca: base: components_open: component tm open function 
successful
[batch35:62735] mca: base: components_open: found loaded component alps
[batch35:62735] mca: base: components_open: component alps open function 
successful
[batch35:62735] mca:base:select: Auto-selecting ras components
[batch35:62735] mca:base:select:(  ras) Querying component [simulator]
[batch35:62735] mca:base:select:(  ras) Querying component [slurm]
[batch35:62735] mca:base:select:(  ras) Querying component [tm]
[batch35:62735] mca:base:select:(  ras) Query of component [tm] set priority to 
100
[batch35:62735] mca:base:select:(  ras) Querying component [alps]
[batch35:62735] ras:alps: available for selection
[batch35:62735] mca:base:select:(  ras) Query of component [alps] set priority 
to 75
[batch35:62735] mca:base:select:(  ras) Selected component [tm]
[batch35:62735] mca: base: close: unloading component simulator
[batch35:62735] mca: base: close: component slurm closed
[batch35:62735] mca: base: close: unloading component slurm
[batch35:62735] mca: base: close: unloading component alps
[batch35:62735] [[34694,0],0] ras:base:allocate
[batch35:62735] [[34694,0],0] ras:tm:allocate:discover: got hostname nid01243
[batch35:62735] [[34694,0],0] ras:tm:allocate:discover: not found -- added to 
list
[batch35:62735] [[34694,0],0] ras:tm:allocate:discover: got hostname nid01244
[batch35:62735] [[34694,0],0] ras:tm:allocate:discover: not found -- added to 
list
[batch35:62735] [[34694,0],0] ras:base:node_insert inserting 2 nodes
[batch35:62735] [[34694,0],0] ras:base:node_insert node nid01243 slots 1
[batch35:62735] [[34694,0],0] ras:base:node_insert node nid01244 slots 1

==   ALLOCATED NODES   ==
  nid01243: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UP
  nid01244: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UP
=
[batch35:62735] plm:alps: final top-level argv:
[batch35:62735] plm:alps: aprun -n 2 -N 1 -cc none -e 
PMI_NO_PREINITIALIZE=1 -e PMI_NO_FORK=1 -e OMPI_NO_USE_CRAY_PMI=1 -L 
nid01243,nid01244 orted -mca orte_debug 1 -mca ess_base_jobid 2273705984 -mca 
ess_base_vpid 1 -mca ess_base_num_procs 3 -mca orte_node_regex 
batch[2:35],nid[5:1243-1244]@0(3) -mca orte_hnp_uri 
2273705984.0;tcp://10.128.8.181:56687
aprun: -L node_list contains an invalid entry

Usage: aprun [global_options] [command_options] cmd1
  [: [command_options] cmd2 [: ...] ]
  [--help] [--version]

--help Print this help information and exit
--version  Print version information
:  Separate binaries for MPMD mode
   (Multiple Program, Multiple Data)

Global Options:
-b, --bypass-app-transfer
Bypass application transfer to compute node
-B, --batch-args
Get values from Batch reservation for -n, -N, -d, and -m
-C, --reconnect
Reconnect fanout control tree around failed nodes
-D, --debug level
Debug level bitmask (0-7)
-e

Re: [OMPI users] [EXTERNAL] Invalid -L flag added to aprun

2024-07-11 Thread Pritchard Jr., Howard via users
Hi Chris

I wonder if somethings messed up with the way alps is interpreting node names 
on the system.

Could you try doing the following:

1. get a two node allocation on your cluster
2. run aprun -n 2 -N 1 hostname
3. take the hostnames returned then run aprun -n 2 -N 1 -L X,Y hostname
Where X= first string returned from the command in step 2 and Y is the second 
string returned from the command in step 2

On 7/11/24, 7:55 AM, "Borchert, Christopher B ERDC-RDE-ITL-MS CIV" 
mailto:christopher.b.borch...@erdc.dren.mil>> wrote:


Thanks Howard. Here is what I got.


batch35:/p/work/borchert> mpirun -n 1 -d ./a.out
[batch35:62735] procdir: /p/work/borchert/ompi.batch35.34110/pid.62735/0/0
[batch35:62735] jobdir: /p/work/borchert/ompi.batch35.34110/pid.62735/0
[batch35:62735] top: /p/work/borchert/ompi.batch35.34110/pid.62735
[batch35:62735] top: /p/work/borchert/ompi.batch35.34110
[batch35:62735] tmp: /p/work/borchert
[batch35:62735] sess_dir_cleanup: job session dir does not exist
[batch35:62735] sess_dir_cleanup: top session dir does not exist
[batch35:62735] procdir: /p/work/borchert/ompi.batch35.34110/pid.62735/0/0
[batch35:62735] jobdir: /p/work/borchert/ompi.batch35.34110/pid.62735/0
[batch35:62735] top: /p/work/borchert/ompi.batch35.34110/pid.62735
[batch35:62735] top: /p/work/borchert/ompi.batch35.34110
[batch35:62735] tmp: /p/work/borchert
[batch35:62735] mca: base: components_register: registering framework ras 
components
[batch35:62735] mca: base: components_register: found loaded component simulator
[batch35:62735] mca: base: components_register: component simulator register 
function successful
[batch35:62735] mca: base: components_register: found loaded component slurm
[batch35:62735] mca: base: components_register: component slurm register 
function successful
[batch35:62735] mca: base: components_register: found loaded component tm
[batch35:62735] mca: base: components_register: component tm register function 
successful
[batch35:62735] mca: base: components_register: found loaded component alps
[batch35:62735] mca: base: components_register: component alps register 
function successful
[batch35:62735] mca: base: components_open: opening ras components
[batch35:62735] mca: base: components_open: found loaded component simulator
[batch35:62735] mca: base: components_open: found loaded component slurm
[batch35:62735] mca: base: components_open: component slurm open function 
successful
[batch35:62735] mca: base: components_open: found loaded component tm
[batch35:62735] mca: base: components_open: component tm open function 
successful
[batch35:62735] mca: base: components_open: found loaded component alps
[batch35:62735] mca: base: components_open: component alps open function 
successful
[batch35:62735] mca:base:select: Auto-selecting ras components
[batch35:62735] mca:base:select:( ras) Querying component [simulator]
[batch35:62735] mca:base:select:( ras) Querying component [slurm]
[batch35:62735] mca:base:select:( ras) Querying component [tm]
[batch35:62735] mca:base:select:( ras) Query of component [tm] set priority to 
100
[batch35:62735] mca:base:select:( ras) Querying component [alps]
[batch35:62735] ras:alps: available for selection
[batch35:62735] mca:base:select:( ras) Query of component [alps] set priority 
to 75
[batch35:62735] mca:base:select:( ras) Selected component [tm]
[batch35:62735] mca: base: close: unloading component simulator
[batch35:62735] mca: base: close: component slurm closed
[batch35:62735] mca: base: close: unloading component slurm
[batch35:62735] mca: base: close: unloading component alps
[batch35:62735] [[34694,0],0] ras:base:allocate
[batch35:62735] [[34694,0],0] ras:tm:allocate:discover: got hostname nid01243
[batch35:62735] [[34694,0],0] ras:tm:allocate:discover: not found -- added to 
list
[batch35:62735] [[34694,0],0] ras:tm:allocate:discover: got hostname nid01244
[batch35:62735] [[34694,0],0] ras:tm:allocate:discover: not found -- added to 
list
[batch35:62735] [[34694,0],0] ras:base:node_insert inserting 2 nodes
[batch35:62735] [[34694,0],0] ras:base:node_insert node nid01243 slots 1
[batch35:62735] [[34694,0],0] ras:base:node_insert node nid01244 slots 1


== ALLOCATED NODES ==
nid01243: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UP
nid01244: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UP
=
[batch35:62735] plm:alps: final top-level argv:
[batch35:62735] plm:alps: aprun -n 2 -N 1 -cc none -e PMI_NO_PREINITIALIZE=1 -e 
PMI_NO_FORK=1 -e OMPI_NO_USE_CRAY_PMI=1 -L nid01243,nid01244 orted -mca 
orte_debug 1 -mca ess_base_jobid 2273705984 -mca ess_base_vpid 1 -mca 
ess_base_num_procs 3 -mca orte_node_regex batch[2:35],nid[5:1243-1244]@0(3) 
-mca orte_hnp_uri 2273705984.0;tcp://10.128.8.181:56687
aprun: -L node_list contains an invalid entry


Usage: aprun [global_options] [command_options] cmd1
[: [command_options] cmd2 [: ...] ]
[--help]

Re: [OMPI users] [EXTERNAL] Invalid -L flag added to aprun

2024-07-11 Thread Borchert, Christopher B ERDC-RDE-ITL-MS CIV via users
It’s the same output and the same result:

batch13:~> aprun -n 2 -N 1 hostname
nid00418
nid00419

batch13:~> aprun -n 2 -N 1 -L nid00418,nid00419 hostname
aprun: -L node_list contains an invalid entry
Usage: aprun [global_options] [command_options] cmd1
...

Thanks,
Chris

-Original Message-
From: Pritchard Jr., Howard  
Sent: Thursday, July 11, 2024 9:03 AM
To: Borchert, Christopher B ERDC-RDE-ITL-MS CIV 
; Open MPI Users 

Subject: Re: [EXTERNAL] [OMPI users] Invalid -L flag added to aprun

Hi Chris

I wonder if somethings messed up with the way alps is interpreting node names 
on the system.

Could you try doing the following:

1. get a two node allocation on your cluster
2. run aprun -n 2 -N 1 hostname
3. take the hostnames returned then run aprun -n 2 -N 1 -L X,Y hostname
Where X= first string returned from the command in step 2 and Y is the second 
string returned from the command in step 2

On 7/11/24, 7:55 AM, "Borchert, Christopher B ERDC-RDE-ITL-MS CIV" 
mailto:christopher.b.borch...@erdc.dren.mil>> wrote:


Thanks Howard. Here is what I got.


batch35:/p/work/borchert> mpirun -n 1 -d ./a.out
[batch35:62735] procdir: /p/work/borchert/ompi.batch35.34110/pid.62735/0/0
[batch35:62735] jobdir: /p/work/borchert/ompi.batch35.34110/pid.62735/0
[batch35:62735] top: /p/work/borchert/ompi.batch35.34110/pid.62735
[batch35:62735] top: /p/work/borchert/ompi.batch35.34110
[batch35:62735] tmp: /p/work/borchert
[batch35:62735] sess_dir_cleanup: job session dir does not exist
[batch35:62735] sess_dir_cleanup: top session dir does not exist
[batch35:62735] procdir: /p/work/borchert/ompi.batch35.34110/pid.62735/0/0
[batch35:62735] jobdir: /p/work/borchert/ompi.batch35.34110/pid.62735/0
[batch35:62735] top: /p/work/borchert/ompi.batch35.34110/pid.62735
[batch35:62735] top: /p/work/borchert/ompi.batch35.34110
[batch35:62735] tmp: /p/work/borchert
[batch35:62735] mca: base: components_register: registering framework ras 
components
[batch35:62735] mca: base: components_register: found loaded component simulator
[batch35:62735] mca: base: components_register: component simulator register 
function successful
[batch35:62735] mca: base: components_register: found loaded component slurm
[batch35:62735] mca: base: components_register: component slurm register 
function successful
[batch35:62735] mca: base: components_register: found loaded component tm
[batch35:62735] mca: base: components_register: component tm register function 
successful
[batch35:62735] mca: base: components_register: found loaded component alps
[batch35:62735] mca: base: components_register: component alps register 
function successful
[batch35:62735] mca: base: components_open: opening ras components
[batch35:62735] mca: base: components_open: found loaded component simulator
[batch35:62735] mca: base: components_open: found loaded component slurm
[batch35:62735] mca: base: components_open: component slurm open function 
successful
[batch35:62735] mca: base: components_open: found loaded component tm
[batch35:62735] mca: base: components_open: component tm open function 
successful
[batch35:62735] mca: base: components_open: found loaded component alps
[batch35:62735] mca: base: components_open: component alps open function 
successful
[batch35:62735] mca:base:select: Auto-selecting ras components
[batch35:62735] mca:base:select:( ras) Querying component [simulator]
[batch35:62735] mca:base:select:( ras) Querying component [slurm]
[batch35:62735] mca:base:select:( ras) Querying component [tm]
[batch35:62735] mca:base:select:( ras) Query of component [tm] set priority to 
100
[batch35:62735] mca:base:select:( ras) Querying component [alps]
[batch35:62735] ras:alps: available for selection
[batch35:62735] mca:base:select:( ras) Query of component [alps] set priority 
to 75
[batch35:62735] mca:base:select:( ras) Selected component [tm]
[batch35:62735] mca: base: close: unloading component simulator
[batch35:62735] mca: base: close: component slurm closed
[batch35:62735] mca: base: close: unloading component slurm
[batch35:62735] mca: base: close: unloading component alps
[batch35:62735] [[34694,0],0] ras:base:allocate
[batch35:62735] [[34694,0],0] ras:tm:allocate:discover: got hostname nid01243
[batch35:62735] [[34694,0],0] ras:tm:allocate:discover: not found -- added to 
list
[batch35:62735] [[34694,0],0] ras:tm:allocate:discover: got hostname nid01244
[batch35:62735] [[34694,0],0] ras:tm:allocate:discover: not found -- added to 
list
[batch35:62735] [[34694,0],0] ras:base:node_insert inserting 2 nodes
[batch35:62735] [[34694,0],0] ras:base:node_insert node nid01243 slots 1
[batch35:62735] [[34694,0],0] ras:base:node_insert node nid01244 slots 1


== ALLOCATED NODES ==
nid01243: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UP
nid01244: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UP
=
[batch35:62735] plm:alps: final top-level arg

Re: [OMPI users] [EXTERNAL] Invalid -L flag added to aprun

2024-07-11 Thread Pritchard Jr., Howard via users
Okay, try setting this environment variable and see if the mpirun command works:

export OMPI_MCA_ras=alps


On 7/11/24, 8:10 AM, "Borchert, Christopher B ERDC-RDE-ITL-MS CIV" 
mailto:christopher.b.borch...@erdc.dren.mil>> wrote:


It’s the same output and the same result:


batch13:~> aprun -n 2 -N 1 hostname
nid00418
nid00419


batch13:~> aprun -n 2 -N 1 -L nid00418,nid00419 hostname
aprun: -L node_list contains an invalid entry
Usage: aprun [global_options] [command_options] cmd1
...


Thanks,
Chris


-Original Message-
From: Pritchard Jr., Howard mailto:howa...@lanl.gov>> 
Sent: Thursday, July 11, 2024 9:03 AM
To: Borchert, Christopher B ERDC-RDE-ITL-MS CIV 
mailto:christopher.b.borch...@erdc.dren.mil>>; Open MPI Users 
mailto:users@lists.open-mpi.org>>
Subject: Re: [EXTERNAL] [OMPI users] Invalid -L flag added to aprun


Hi Chris


I wonder if somethings messed up with the way alps is interpreting node names 
on the system.


Could you try doing the following:


1. get a two node allocation on your cluster
2. run aprun -n 2 -N 1 hostname
3. take the hostnames returned then run aprun -n 2 -N 1 -L X,Y hostname
Where X= first string returned from the command in step 2 and Y is the second 
string returned from the command in step 2


On 7/11/24, 7:55 AM, "Borchert, Christopher B ERDC-RDE-ITL-MS CIV" 
mailto:christopher.b.borch...@erdc.dren.mil> 
>> wrote:




Thanks Howard. Here is what I got.




batch35:/p/work/borchert> mpirun -n 1 -d ./a.out
[batch35:62735] procdir: /p/work/borchert/ompi.batch35.34110/pid.62735/0/0
[batch35:62735] jobdir: /p/work/borchert/ompi.batch35.34110/pid.62735/0
[batch35:62735] top: /p/work/borchert/ompi.batch35.34110/pid.62735
[batch35:62735] top: /p/work/borchert/ompi.batch35.34110
[batch35:62735] tmp: /p/work/borchert
[batch35:62735] sess_dir_cleanup: job session dir does not exist
[batch35:62735] sess_dir_cleanup: top session dir does not exist
[batch35:62735] procdir: /p/work/borchert/ompi.batch35.34110/pid.62735/0/0
[batch35:62735] jobdir: /p/work/borchert/ompi.batch35.34110/pid.62735/0
[batch35:62735] top: /p/work/borchert/ompi.batch35.34110/pid.62735
[batch35:62735] top: /p/work/borchert/ompi.batch35.34110
[batch35:62735] tmp: /p/work/borchert
[batch35:62735] mca: base: components_register: registering framework ras 
components
[batch35:62735] mca: base: components_register: found loaded component simulator
[batch35:62735] mca: base: components_register: component simulator register 
function successful
[batch35:62735] mca: base: components_register: found loaded component slurm
[batch35:62735] mca: base: components_register: component slurm register 
function successful
[batch35:62735] mca: base: components_register: found loaded component tm
[batch35:62735] mca: base: components_register: component tm register function 
successful
[batch35:62735] mca: base: components_register: found loaded component alps
[batch35:62735] mca: base: components_register: component alps register 
function successful
[batch35:62735] mca: base: components_open: opening ras components
[batch35:62735] mca: base: components_open: found loaded component simulator
[batch35:62735] mca: base: components_open: found loaded component slurm
[batch35:62735] mca: base: components_open: component slurm open function 
successful
[batch35:62735] mca: base: components_open: found loaded component tm
[batch35:62735] mca: base: components_open: component tm open function 
successful
[batch35:62735] mca: base: components_open: found loaded component alps
[batch35:62735] mca: base: components_open: component alps open function 
successful
[batch35:62735] mca:base:select: Auto-selecting ras components
[batch35:62735] mca:base:select:( ras) Querying component [simulator]
[batch35:62735] mca:base:select:( ras) Querying component [slurm]
[batch35:62735] mca:base:select:( ras) Querying component [tm]
[batch35:62735] mca:base:select:( ras) Query of component [tm] set priority to 
100
[batch35:62735] mca:base:select:( ras) Querying component [alps]
[batch35:62735] ras:alps: available for selection
[batch35:62735] mca:base:select:( ras) Query of component [alps] set priority 
to 75
[batch35:62735] mca:base:select:( ras) Selected component [tm]
[batch35:62735] mca: base: close: unloading component simulator
[batch35:62735] mca: base: close: component slurm closed
[batch35:62735] mca: base: close: unloading component slurm
[batch35:62735] mca: base: close: unloading component alps
[batch35:62735] [[34694,0],0] ras:base:allocate
[batch35:62735] [[34694,0],0] ras:tm:allocate:discover: got hostname nid01243
[batch35:62735] [[34694,0],0] ras:tm:allocate:discover: not found -- added to 
list
[batch35:62735] [[34694,0],0] ras:tm:allocate:discover: got hostname nid01244
[batch35:62735] [[34694,0],0] ras:tm:allocate:discover: not found -- added to 
list
[batch35:62735] [[34694,0],0] ras:base:node_insert inserting 2 nodes
[batch

Re: [OMPI users] [EXTERNAL] Invalid -L flag added to aprun

2024-07-11 Thread Borchert, Christopher B ERDC-RDE-ITL-MS CIV via users
That did it! Thanks Howard!

-Original Message-
From: Pritchard Jr., Howard  
Sent: Thursday, July 11, 2024 9:14 AM
To: Borchert, Christopher B ERDC-RDE-ITL-MS CIV 
; Open MPI Users 

Subject: Re: [EXTERNAL] [OMPI users] Invalid -L flag added to aprun

Okay, try setting this environment variable and see if the mpirun command works:

export OMPI_MCA_ras=alps


On 7/11/24, 8:10 AM, "Borchert, Christopher B ERDC-RDE-ITL-MS CIV" 
mailto:christopher.b.borch...@erdc.dren.mil>> wrote:


It’s the same output and the same result:


batch13:~> aprun -n 2 -N 1 hostname
nid00418
nid00419


batch13:~> aprun -n 2 -N 1 -L nid00418,nid00419 hostname
aprun: -L node_list contains an invalid entry
Usage: aprun [global_options] [command_options] cmd1 ...


Thanks,
Chris


-Original Message-
From: Pritchard Jr., Howard mailto:howa...@lanl.gov>>
Sent: Thursday, July 11, 2024 9:03 AM
To: Borchert, Christopher B ERDC-RDE-ITL-MS CIV 
mailto:christopher.b.borch...@erdc.dren.mil>>; Open MPI Users 
mailto:users@lists.open-mpi.org>>
Subject: Re: [EXTERNAL] [OMPI users] Invalid -L flag added to aprun


Hi Chris


I wonder if somethings messed up with the way alps is interpreting node names 
on the system.


Could you try doing the following:


1. get a two node allocation on your cluster
2. run aprun -n 2 -N 1 hostname
3. take the hostnames returned then run aprun -n 2 -N 1 -L X,Y hostname
Where X= first string returned from the command in step 2 and Y is the second 
string returned from the command in step 2


On 7/11/24, 7:55 AM, "Borchert, Christopher B ERDC-RDE-ITL-MS CIV" 
mailto:christopher.b.borch...@erdc.dren.mil> 
>> wrote:




Thanks Howard. Here is what I got.




batch35:/p/work/borchert> mpirun -n 1 -d ./a.out
[batch35:62735] procdir: /p/work/borchert/ompi.batch35.34110/pid.62735/0/0
[batch35:62735] jobdir: /p/work/borchert/ompi.batch35.34110/pid.62735/0
[batch35:62735] top: /p/work/borchert/ompi.batch35.34110/pid.62735
[batch35:62735] top: /p/work/borchert/ompi.batch35.34110
[batch35:62735] tmp: /p/work/borchert
[batch35:62735] sess_dir_cleanup: job session dir does not exist
[batch35:62735] sess_dir_cleanup: top session dir does not exist
[batch35:62735] procdir: /p/work/borchert/ompi.batch35.34110/pid.62735/0/0
[batch35:62735] jobdir: /p/work/borchert/ompi.batch35.34110/pid.62735/0
[batch35:62735] top: /p/work/borchert/ompi.batch35.34110/pid.62735
[batch35:62735] top: /p/work/borchert/ompi.batch35.34110
[batch35:62735] tmp: /p/work/borchert
[batch35:62735] mca: base: components_register: registering framework ras 
components
[batch35:62735] mca: base: components_register: found loaded component simulator
[batch35:62735] mca: base: components_register: component simulator register 
function successful
[batch35:62735] mca: base: components_register: found loaded component slurm
[batch35:62735] mca: base: components_register: component slurm register 
function successful
[batch35:62735] mca: base: components_register: found loaded component tm
[batch35:62735] mca: base: components_register: component tm register function 
successful
[batch35:62735] mca: base: components_register: found loaded component alps
[batch35:62735] mca: base: components_register: component alps register 
function successful
[batch35:62735] mca: base: components_open: opening ras components
[batch35:62735] mca: base: components_open: found loaded component simulator
[batch35:62735] mca: base: components_open: found loaded component slurm
[batch35:62735] mca: base: components_open: component slurm open function 
successful
[batch35:62735] mca: base: components_open: found loaded component tm
[batch35:62735] mca: base: components_open: component tm open function 
successful
[batch35:62735] mca: base: components_open: found loaded component alps
[batch35:62735] mca: base: components_open: component alps open function 
successful
[batch35:62735] mca:base:select: Auto-selecting ras components
[batch35:62735] mca:base:select:( ras) Querying component [simulator]
[batch35:62735] mca:base:select:( ras) Querying component [slurm]
[batch35:62735] mca:base:select:( ras) Querying component [tm]
[batch35:62735] mca:base:select:( ras) Query of component [tm] set priority to 
100
[batch35:62735] mca:base:select:( ras) Querying component [alps]
[batch35:62735] ras:alps: available for selection
[batch35:62735] mca:base:select:( ras) Query of component [alps] set priority 
to 75
[batch35:62735] mca:base:select:( ras) Selected component [tm]
[batch35:62735] mca: base: close: unloading component simulator
[batch35:62735] mca: base: close: component slurm closed
[batch35:62735] mca: base: close: unloading component slurm
[batch35:62735] mca: base: close: unloading component alps
[batch35:62735] [[34694,0],0] ras:base:allocate
[batch35:62735] [[34694,0],0] ras:tm:allocate:discover: got hostname nid01243
[batch35:62735] [[34694,0],0] ras:tm:allocate:discover: not f

Re: [OMPI users] [EXTERNAL] Invalid -L flag added to aprun

2024-07-11 Thread Pritchard Jr., Howard via users
Okay.  Something must have broken between 4.0.x and 4.1.x to give pbs pro ras 
priority over alps even for Cray XC systems.

On 7/11/24, 8:21 AM, "Borchert, Christopher B ERDC-RDE-ITL-MS CIV" 
mailto:christopher.b.borch...@erdc.dren.mil>> wrote:


That did it! Thanks Howard!


-Original Message-
From: Pritchard Jr., Howard mailto:howa...@lanl.gov>> 
Sent: Thursday, July 11, 2024 9:14 AM
To: Borchert, Christopher B ERDC-RDE-ITL-MS CIV 
mailto:christopher.b.borch...@erdc.dren.mil>>; Open MPI Users 
mailto:users@lists.open-mpi.org>>
Subject: Re: [EXTERNAL] [OMPI users] Invalid -L flag added to aprun


Okay, try setting this environment variable and see if the mpirun command works:


export OMPI_MCA_ras=alps




On 7/11/24, 8:10 AM, "Borchert, Christopher B ERDC-RDE-ITL-MS CIV" 
mailto:christopher.b.borch...@erdc.dren.mil> 
>> wrote:




It’s the same output and the same result:




batch13:~> aprun -n 2 -N 1 hostname
nid00418
nid00419




batch13:~> aprun -n 2 -N 1 -L nid00418,nid00419 hostname
aprun: -L node_list contains an invalid entry
Usage: aprun [global_options] [command_options] cmd1 ...




Thanks,
Chris




-Original Message-
From: Pritchard Jr., Howard mailto:howa...@lanl.gov> 
>>
Sent: Thursday, July 11, 2024 9:03 AM
To: Borchert, Christopher B ERDC-RDE-ITL-MS CIV 
mailto:christopher.b.borch...@erdc.dren.mil> 
>>; Open MPI Users 
mailto:users@lists.open-mpi.org> 
>>
Subject: Re: [EXTERNAL] [OMPI users] Invalid -L flag added to aprun




Hi Chris




I wonder if somethings messed up with the way alps is interpreting node names 
on the system.




Could you try doing the following:




1. get a two node allocation on your cluster
2. run aprun -n 2 -N 1 hostname
3. take the hostnames returned then run aprun -n 2 -N 1 -L X,Y hostname
Where X= first string returned from the command in step 2 and Y is the second 
string returned from the command in step 2




On 7/11/24, 7:55 AM, "Borchert, Christopher B ERDC-RDE-ITL-MS CIV" 
mailto:christopher.b.borch...@erdc.dren.mil> 
> 
 
 mpirun -n 1 -d ./a.out
[batch35:62735] procdir: /p/work/borchert/ompi.batch35.34110/pid.62735/0/0
[batch35:62735] jobdir: /p/work/borchert/ompi.batch35.34110/pid.62735/0
[batch35:62735] top: /p/work/borchert/ompi.batch35.34110/pid.62735
[batch35:62735] top: /p/work/borchert/ompi.batch35.34110
[batch35:62735] tmp: /p/work/borchert
[batch35:62735] sess_dir_cleanup: job session dir does not exist
[batch35:62735] sess_dir_cleanup: top session dir does not exist
[batch35:62735] procdir: /p/work/borchert/ompi.batch35.34110/pid.62735/0/0
[batch35:62735] jobdir: /p/work/borchert/ompi.batch35.34110/pid.62735/0
[batch35:62735] top: /p/work/borchert/ompi.batch35.34110/pid.62735
[batch35:62735] top: /p/work/borchert/ompi.batch35.34110
[batch35:62735] tmp: /p/work/borchert
[batch35:62735] mca: base: components_register: registering framework ras 
components
[batch35:62735] mca: base: components_register: found loaded component simulator
[batch35:62735] mca: base: components_register: component simulator register 
function successful
[batch35:62735] mca: base: components_register: found loaded component slurm
[batch35:62735] mca: base: components_register: component slurm register 
function successful
[batch35:62735] mca: base: components_register: found loaded component tm
[batch35:62735] mca: base: components_register: component tm register function 
successful
[batch35:62735] mca: base: components_register: found loaded component alps
[batch35:62735] mca: base: components_register: component alps register 
function successful
[batch35:62735] mca: base: components_open: opening ras components
[batch35:62735] mca: base: components_open: found loaded component simulator
[batch35:62735] mca: base: components_open: found loaded component slurm
[batch35:62735] mca: base: components_open: component slurm open function 
successful
[batch35:62735] mca: base: components_open: found loaded component tm
[batch35:62735] mca: base: components_open: component tm open function 
successful
[batch35:62735] mca: base: components_open: found loaded component alps
[batch35:62735] mca: base: components_open: component alps open function 
successful
[batch35:62735] mca:base:select: Auto-selecting ras components
[batch35:62735] mca:base:select:( ras) Querying component [simulator]
[batch35:62735] mca: