Re: [OMPI users] Please help me interpret MPI output

2019-11-21 Thread Peter Kjellström via users
On Wed, 20 Nov 2019 17:38:19 +
"Mccall, Kurt E. \(MSFC-EV41\) via users" 
wrote:

> Hi,
> 
> My job is behaving differently on its two nodes, refusing to
> MPI_Comm_spawn() a process on one of them but succeeding on the
> other.
...
> Data for node: n002Num slots: 3... Bound: N/A
> Data for node: n001Num slots: 3... Bound:
> socket 0[core 0[hwt 0]]:[B/././././././././.][./././././././././.]
...
> Why is the Bound output different between n001 and n002?

Without knowing more details (like what exact openmpi, how exactly did
you try to launch) etc. you're not likely to get good answers.

But it does seem clear that the process/rank to hardware (core) pinning
happened on one but not the other node.

This suggests a broken install and/or enviroment and/or non-standard
launch.

/Peter K


Re: [OMPI users] Interpreting the output of --display-map and --display-allocation

2019-11-21 Thread Peter Kjellström via users
On Mon, 18 Nov 2019 17:48:30 +
"Mccall, Kurt E. \(MSFC-EV41\) via users" 
wrote:

> I'm trying to debug a problem with my job, launched with the mpiexec
> options -display-map and -display-allocation, but I don't know how to
> interpret the output.   For example,  mpiexec displays the following
> when a job is spawned by MPI_Comm_spawn():
> 
> ==   ALLOCATED NODES   ==
> n002: flags=0x11 slots=3 max_slots=0 slots_inuse=2 state=UP
> n001: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP
> 
> Maybe the differing "flags" values have bearing on the problem, but I
> don't know what they mean.   Are the outputs of these two options
> documented anywhere?

I don't know of any such specific documentation but the flag values are
defined in:

 orte/util/attr.h:54 (openmpi-3.1.4)

The difference between your nodes (bit value 0x2) means:

 #define ORTE_NODE_FLAG_LOC_VERIFIED   0x02   

 // whether or not the location has been verified - used for
 // environments where the daemon's final destination is uncertain

I do not know what that means exactly but it is not related to pinning
on or off.

Seems to indicate a broken launch and/or install and/or environment.

/Peter K


Re: [OMPI users] [EXTERNAL] Re: Please help me interpret MPI output

2019-11-21 Thread Mccall, Kurt E. (MSFC-EV41) via users
Thanks for responding.   Here are some more details.   I'm using OpenMpi 4.0.2, 
compiled with the Portland Group Compiler, pgc++ 19.5-0, with the build flags

--enable-mpi-cxx  --enable-cxx-exceptions  --with-tm

PBS/Torque version is 5.1.1.

I launched the job with qsub:

qsub -V -j oe -e ./stdio -o ./stdio -f -X -N MyJob -l nodes=2:ppn=3  
RunMyJob.bash

My abbreviated mpiexec command within RunMyJob.bash was:

mpiexec --enable-recovery -display-map --display-allocation --mca 
mpi_param_check 1  --v --x DISPLAY --np 2  --map-by ppr:1:node 


-Original Message-
From: Peter Kjellström  
Sent: Thursday, November 21, 2019 3:40 AM
To: Mccall, Kurt E. (MSFC-EV41) 
Cc: users@lists.open-mpi.org
Subject: [EXTERNAL] Re: [OMPI users] Please help me interpret MPI output

On Wed, 20 Nov 2019 17:38:19 +
"Mccall, Kurt E. \(MSFC-EV41\) via users" 
wrote:

> Hi,
> 
> My job is behaving differently on its two nodes, refusing to
> MPI_Comm_spawn() a process on one of them but succeeding on the other.
...
> Data for node: n002Num slots: 3... Bound: N/A
> Data for node: n001Num slots: 3... Bound:
> socket 0[core 0[hwt 0]]:[B/././././././././.][./././././././././.]
...
> Why is the Bound output different between n001 and n002?

Without knowing more details (like what exact openmpi, how exactly did you try 
to launch) etc. you're not likely to get good answers.

But it does seem clear that the process/rank to hardware (core) pinning 
happened on one but not the other node.

This suggests a broken install and/or enviroment and/or non-standard launch.

/Peter K


[OMPI users] Slides from the Open MPI SC'19 BOF

2019-11-21 Thread Jeff Squyres (jsquyres) via users
Thanks to all who came to see the Open MPI State of the Union BOF at SC'19 in 
Denver yesterday.

I have posted the slides on the Open MPI web site -- that may take a little 
time to propagate out through the CDN to reach everyone, but they should show 
up soon:

https://www.open-mpi.org/papers/sc-2019/

-- 
Jeff Squyres
jsquy...@cisco.com