Then those flags are correct. I suspect mpirun is executing on n006, yes? The 
"location verified" just means that the daemon of rank N reported back from the 
node we expected it to be on - Slurm and Cray sometimes renumber the ranks. 
Torque doesn't and so you should never see a problem. Since mpirun isn't 
launched by itself, its node is never "verified", though I probably should 
alter that as it is obviously in the "right" place.

I don't know what you mean by your app isn't behaving correctly on the remote 
nodes - best guess is that perhaps some envar they need isn't being forwarded?


On Apr 14, 2020, at 2:04 AM, Mccall, Kurt E. (MSFC-EV41) 
<kurt.e.mcc...@nasa.gov <mailto:kurt.e.mcc...@nasa.gov> > wrote:

CentOS, Torque.
   From: Ralph Castain <r...@open-mpi.org <mailto:r...@open-mpi.org> > 
Sent: Monday, April 13, 2020 5:44 PM
To: Mccall, Kurt E. (MSFC-EV41) <kurt.e.mcc...@nasa.gov 
<mailto:kurt.e.mcc...@nasa.gov> >
Subject: [EXTERNAL] Re: [OMPI users] Meaning of mpiexec error flags

 What kind of system are you running on? Slurm? Cray? ...?
 

On Apr 13, 2020, at 3:11 PM, Mccall, Kurt E. (MSFC-EV41) 
<kurt.e.mcc...@nasa.gov <mailto:kurt.e.mcc...@nasa.gov> > wrote:
 Thanks Ralph.   So the difference between the working node flag (0x11) and the 
non-working nodes’ flags (0x13) is the flagPRRTE_NODE_FLAG_LOC_VERIFIED.    
What does that imply?   The location of the daemon has NOT been verified?
 Kurt
 From: users <users-boun...@lists.open-mpi.org 
<mailto:users-boun...@lists.open-mpi.org> > On Behalf Of Ralph Castain via users
Sent: Monday, April 13, 2020 4:47 PM
To: Open MPI Users <users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >
Cc: Ralph Castain <r...@open-mpi.org <mailto:r...@open-mpi.org> >
Subject: [EXTERNAL] Re: [OMPI users] Meaning of mpiexec error flags

 I updated the message to explain the flags (instead of a numerical value) for 
OMPI v5. In brief:
 #define PRRTE_NODE_FLAG_DAEMON_LAUNCHED    0x01   // whether or not the daemon 
on this node has been launched
#define PRRTE_NODE_FLAG_LOC_VERIFIED               0x02   // whether or not the 
location has been verified - used for
                                                                                
                      // environments where the daemon's final destination is 
uncertain
#define PRRTE_NODE_FLAG_OVERSUBSCRIBED       0x04   // whether or not this node 
is oversubscribed
#define PRRTE_NODE_FLAG_MAPPED                         0x08   // whether we 
have been added to the current map
#define PRRTE_NODE_FLAG_SLOTS_GIVEN               0x10   // the number of slots 
was specified - used only in non-managed environments
#define PRRTE_NODE_NON_USABLE                           0x20   // the node is 
hosting a tool and is NOT to be used for jobs
  


On Apr 13, 2020, at 2:15 PM, Mccall, Kurt E. (MSFC-EV41) via users 
<users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> > wrote:
 My application is behaving correctly on node n006, and incorrectly on the 
lower numbered nodes.   The flags in the error message below may give a clue as 
to why.   What is the meaning of the flag values 0x11 and 0x13?
 ======================   ALLOCATED NODES   ======================
        n006: flags=0x11 slots=3 max_slots=0 slots_inuse=2 state=UP
        n005: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP
        n004: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP
        n003: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP
        n002: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP
        n001: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP
 I’m using OpenMPI 4.0.3.
 Thanks,
Kurt

Reply via email to