Darn, I was hoping the flags would give a clue to the malfunction, which I’ve
been trying to solve for weeks. MPI_Comm_spawn() correctly spawns a worker on
the node the mpirun is executing on, but on other nodes it says the following:
****
There are no allocated resources for the application:
/home/kmccall/mav/9.15_mpi/mav
that match the requested mapping:
-host: n002.cluster.com:3
Verify that you have mapped the allocated resources properly for the
indicated specification.
[n002:08645] *** An error occurred in MPI_Comm_spawn
[n002:08645] *** reported by process [1225916417,4]
[n002:08645] *** on communicator MPI_COMM_SELF
[n002:08645] *** MPI_ERR_SPAWN: could not spawn processes
[n002:08645] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
abort,
[n002:08645] *** and potentially your MPI job)
As you suggested several weeks ago, I added a process count to the host name
(n001.cluster.com:3) but it didn’t help. Here is how I set up the “info”
argument to MPI_Comm_spawn to spawn a single worker:
char info_str[64], host_str[64];
sprintf(info_str, "ppr:%d:node", 1);
sprintf(host_str, "%s:%d", host_name_.c_str(), 3); // added “:3” to
host name
MPI_Info_create(&info);
MPI_Info_set(info, "host", host_str);
MPI_Info_set(info, "map-by", info_str);
MPI_Info_set(info, "ompi_non_mpi", "true");
From: users <[email protected]> On Behalf Of Ralph Castain via
users
Sent: Tuesday, April 14, 2020 8:13 AM
To: Open MPI Users <[email protected]>
Cc: Ralph Castain <[email protected]>
Subject: [EXTERNAL] Re: [OMPI users] Meaning of mpiexec error flags
Then those flags are correct. I suspect mpirun is executing on n006, yes? The
"location verified" just means that the daemon of rank N reported back from the
node we expected it to be on - Slurm and Cray sometimes renumber the ranks.
Torque doesn't and so you should never see a problem. Since mpirun isn't
launched by itself, its node is never "verified", though I probably should
alter that as it is obviously in the "right" place.
I don't know what you mean by your app isn't behaving correctly on the remote
nodes - best guess is that perhaps some envar they need isn't being forwarded?
On Apr 14, 2020, at 2:04 AM, Mccall, Kurt E. (MSFC-EV41)
<[email protected]<mailto:[email protected]>> wrote:
CentOS, Torque.
From: Ralph Castain <[email protected]<mailto:[email protected]>>
Sent: Monday, April 13, 2020 5:44 PM
To: Mccall, Kurt E. (MSFC-EV41)
<[email protected]<mailto:[email protected]>>
Subject: [EXTERNAL] Re: [OMPI users] Meaning of mpiexec error flags
What kind of system are you running on? Slurm? Cray? ...?
On Apr 13, 2020, at 3:11 PM, Mccall, Kurt E. (MSFC-EV41)
<[email protected]<mailto:[email protected]>> wrote:
Thanks Ralph. So the difference between the working node flag (0x11) and the
non-working nodes’ flags (0x13) is the flagPRRTE_NODE_FLAG_LOC_VERIFIED.
What does that imply? The location of the daemon has NOT been verified?
Kurt
From: users
<[email protected]<mailto:[email protected]>> On
Behalf Of Ralph Castain via users
Sent: Monday, April 13, 2020 4:47 PM
To: Open MPI Users <[email protected]<mailto:[email protected]>>
Cc: Ralph Castain <[email protected]<mailto:[email protected]>>
Subject: [EXTERNAL] Re: [OMPI users] Meaning of mpiexec error flags
I updated the message to explain the flags (instead of a numerical value) for
OMPI v5. In brief:
#define PRRTE_NODE_FLAG_DAEMON_LAUNCHED 0x01 // whether or not the daemon
on this node has been launched
#define PRRTE_NODE_FLAG_LOC_VERIFIED 0x02 // whether or not the
location has been verified - used for
// environments where the daemon's final destination is
uncertain
#define PRRTE_NODE_FLAG_OVERSUBSCRIBED 0x04 // whether or not this node
is oversubscribed
#define PRRTE_NODE_FLAG_MAPPED 0x08 // whether we
have been added to the current map
#define PRRTE_NODE_FLAG_SLOTS_GIVEN 0x10 // the number of slots
was specified - used only in non-managed environments
#define PRRTE_NODE_NON_USABLE 0x20 // the node is
hosting a tool and is NOT to be used for jobs
On Apr 13, 2020, at 2:15 PM, Mccall, Kurt E. (MSFC-EV41) via users
<[email protected]<mailto:[email protected]>> wrote:
My application is behaving correctly on node n006, and incorrectly on the lower
numbered nodes. The flags in the error message below may give a clue as to
why. What is the meaning of the flag values 0x11 and 0x13?
====================== ALLOCATED NODES ======================
n006: flags=0x11 slots=3 max_slots=0 slots_inuse=2 state=UP
n005: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP
n004: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP
n003: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP
n002: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP
n001: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP
I’m using OpenMPI 4.0.3.
Thanks,
Kurt