Re: [OMPI users] Please help me interpret MPI output
On Wed, 20 Nov 2019 17:38:19 + "Mccall, Kurt E. \(MSFC-EV41\) via users" wrote: > Hi, > > My job is behaving differently on its two nodes, refusing to > MPI_Comm_spawn() a process on one of them but succeeding on the > other. ... > Data for node: n002Num slots: 3... Bound: N/A > Data for node: n001Num slots: 3... Bound: > socket 0[core 0[hwt 0]]:[B/././././././././.][./././././././././.] ... > Why is the Bound output different between n001 and n002? Without knowing more details (like what exact openmpi, how exactly did you try to launch) etc. you're not likely to get good answers. But it does seem clear that the process/rank to hardware (core) pinning happened on one but not the other node. This suggests a broken install and/or enviroment and/or non-standard launch. /Peter K
Re: [OMPI users] Interpreting the output of --display-map and --display-allocation
On Mon, 18 Nov 2019 17:48:30 + "Mccall, Kurt E. \(MSFC-EV41\) via users" wrote: > I'm trying to debug a problem with my job, launched with the mpiexec > options -display-map and -display-allocation, but I don't know how to > interpret the output. For example, mpiexec displays the following > when a job is spawned by MPI_Comm_spawn(): > > == ALLOCATED NODES == > n002: flags=0x11 slots=3 max_slots=0 slots_inuse=2 state=UP > n001: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP > > Maybe the differing "flags" values have bearing on the problem, but I > don't know what they mean. Are the outputs of these two options > documented anywhere? I don't know of any such specific documentation but the flag values are defined in: orte/util/attr.h:54 (openmpi-3.1.4) The difference between your nodes (bit value 0x2) means: #define ORTE_NODE_FLAG_LOC_VERIFIED 0x02 // whether or not the location has been verified - used for // environments where the daemon's final destination is uncertain I do not know what that means exactly but it is not related to pinning on or off. Seems to indicate a broken launch and/or install and/or environment. /Peter K
Re: [OMPI users] [EXTERNAL] Re: Please help me interpret MPI output
Thanks for responding. Here are some more details. I'm using OpenMpi 4.0.2, compiled with the Portland Group Compiler, pgc++ 19.5-0, with the build flags --enable-mpi-cxx --enable-cxx-exceptions --with-tm PBS/Torque version is 5.1.1. I launched the job with qsub: qsub -V -j oe -e ./stdio -o ./stdio -f -X -N MyJob -l nodes=2:ppn=3 RunMyJob.bash My abbreviated mpiexec command within RunMyJob.bash was: mpiexec --enable-recovery -display-map --display-allocation --mca mpi_param_check 1 --v --x DISPLAY --np 2 --map-by ppr:1:node -Original Message- From: Peter Kjellström Sent: Thursday, November 21, 2019 3:40 AM To: Mccall, Kurt E. (MSFC-EV41) Cc: users@lists.open-mpi.org Subject: [EXTERNAL] Re: [OMPI users] Please help me interpret MPI output On Wed, 20 Nov 2019 17:38:19 + "Mccall, Kurt E. \(MSFC-EV41\) via users" wrote: > Hi, > > My job is behaving differently on its two nodes, refusing to > MPI_Comm_spawn() a process on one of them but succeeding on the other. ... > Data for node: n002Num slots: 3... Bound: N/A > Data for node: n001Num slots: 3... Bound: > socket 0[core 0[hwt 0]]:[B/././././././././.][./././././././././.] ... > Why is the Bound output different between n001 and n002? Without knowing more details (like what exact openmpi, how exactly did you try to launch) etc. you're not likely to get good answers. But it does seem clear that the process/rank to hardware (core) pinning happened on one but not the other node. This suggests a broken install and/or enviroment and/or non-standard launch. /Peter K
[OMPI users] Slides from the Open MPI SC'19 BOF
Thanks to all who came to see the Open MPI State of the Union BOF at SC'19 in Denver yesterday. I have posted the slides on the Open MPI web site -- that may take a little time to propagate out through the CDN to reach everyone, but they should show up soon: https://www.open-mpi.org/papers/sc-2019/ -- Jeff Squyres jsquy...@cisco.com