Re: [OMPI users] Intermittent, somewhat architecture-dependent hang with Open MPI 1.8.1

Matt Thompson Sat, 16 Aug 2014 15:09:39 -0400 (EDT)

Jeff,

I've tried moving the backing file and it doesn't matter. I can say that
PGI 14.7 + Open MPI 1.8.1 does not show this issue. I can run that on 96
cores just fine. Heck, I've run it on a few hundred.


As for the 96, they are either on 8 Westmere nodes (8 nodes with 2 6-core
sockets) or 6 Sandy Bridge nodes (6 nodes with 2 8-core sockets). I think
each set is on a different Infiniband fabric, but I'm not sure of that.
However, since the PGI 14.7/Open MPI 1.8.1 works just fine on the exact
same sets of nodes (grabbed via an interactive SLURM job), I can't see how
the Infiniband fabric would matter.

I also tried various combinations of:

  mpirun --np
  mpirun --map-by core -np
  mpirun --map-by socket -np

and maybe a few -bind-to as well all with --report-bindings on to make sure
it was doing what I expect, and it was. It wasn't putting 96 processes on a
single node, for example, or all on the same socket or core by some freak
accident.

The only difference between the Open MPI installs are the compilers they
were built with (I'm pretty sure the admins just downloaded the source
once). Looking at "mpif90 -showme" I can see that the PGI 14.7 compile
built the mpi_f90 and mpi modules while it looks like the GCC 4.9.1 did
not, but our main code and this reproducer only use mpif.h, so that
shouldn't matter.

Matt


On Sat, Aug 16, 2014 at 7:33 AM, Jeff Squyres (jsquyres) <jsquy...@cisco.com
> wrote:

> Have you tried moving your shared memory backing file directory, like the
> warning message suggests?
>
> I haven't seen a shared memory file on a network share cause correctness
> issues before (just performance issues), but I could see how that could be
> in the realm of possibility...
>
> Also, are you running 96 processes on a single machine, or spread across
> multiple machines?
>
> Note that Open MPI 1.8.x binds each MPI process to a core by default, so
> if you're oversubscribing the machine, it could be fairly disastrous...?
>
>
> On Aug 14, 2014, at 1:29 PM, Matt Thompson <fort...@gmail.com> wrote:
>
> > Open MPI Users,
> >
> > I work on a large climate model called GEOS-5 and we've recently managed
> to get it to compile with gfortran 4.9.1 (our usual compilers are Intel and
> PGI for performance). In doing so, we asked our admins to install Open MPI
> 1.8.1 as the MPI stack instead of MVAPICH2 2.0 mainly because we figure the
> gfortran port is more geared to a desktop.
> >
> > So, the model builds just fine but when we run it, it stalls in our
> "History" component whose job is to write out netCDF files of output. The
> odd thing is, though, this stall seems to happen more on our Sandy Bridge
> nodes than on our Westmere nodes, but both hang.
> >
> > A colleague has made a single-file code that emulates our History
> component (the MPI traffic part) that we've used to report bugs to MVAPICH
> and I asked him to try it with this issue and it seems to duplicate it.
> >
> > To wit, a "successful" run of the code is:
> >
> > (1003) $ mpirun -np 96 ./mpi_reproducer.x 4 24
> > srun.slurm: cluster configuration lacks support for cpu binding
> > srun.slurm: cluster configuration lacks support for cpu binding
> >
> --------------------------------------------------------------------------
> > WARNING: Open MPI will create a shared memory backing file in a
> > directory that appears to be mounted on a network filesystem.
> > Creating the shared memory backup file on a network file system, such
> > as NFS or Lustre is not recommended -- it may cause excessive network
> > traffic to your file servers and/or cause shared memory traffic in
> > Open MPI to be much slower than expected.
> >
> > You may want to check what the typical temporary directory is on your
> > node.  Possible sources of the location of this temporary directory
> > include the $TEMPDIR, $TEMP, and $TMP environment variables.
> >
> > Note, too, that system administrators can set a list of filesystems
> > where Open MPI is disallowed from creating temporary files by setting
> > the MCA parameter "orte_no_session_dir".
> >
> >   Local host: borg01s026
> >   Fileame:
> /gpfsm/dnb31/tdirs/pbs/slurm.2202701.mathomp4/openmpi-sessions-mathomp4@borg01s026_0
> /60464/1/shared_mem_pool.borg01s026
> >
> > You can set the MCA paramter shmem_mmap_enable_nfs_warning to 0 to
> > disable this message.
> >
> --------------------------------------------------------------------------
> >  nx:            4
> >  ny:           24
> >  comm size is           96
> >  local array sizes are          12          12
> >  filling local arrays
> >  creating requests
> >  igather
> >  before collective wait
> >  after collective wait
> >  result is            1   1.00000000       1.00000000
> >  result is            2   1.41421354       1.41421354
> >  result is            3   1.73205078       1.73205078
> >  result is            4   2.00000000       2.00000000
> >  result is            5   2.23606801       2.23606801
> >  result is            6   2.44948983       2.44948983
> >  result is            7   2.64575124       2.64575124
> >  result is            8   2.82842708       2.82842708
> > ...snip...
> >  result is          939   30.6431065       30.6431065
> >  result is          940   30.6594200       30.6594200
> >  result is          941   30.6757240       30.6757240
> >  result is          942   30.6920185       30.6920185
> >  result is          943   30.7083054       30.7083054
> >  result is          944   30.7245827       30.7245827
> >  result is          945   30.7408524       30.7408524
> >
> > Where the second and third columns of numbers are just the square root
> of the first.
> >
> > But, often, the runs do this (note I'm removing the
> shmem_mmap_enable_nfs_warning message for sanity's sake from these copy and
> pastes):
> >
> > (1196) $ mpirun -np 96 ./mpi_reproducer.x 4 24
> > srun.slurm: cluster configuration lacks support for cpu binding
> > srun.slurm: cluster configuration lacks support for cpu binding
> >  nx:            4
> >  ny:           24
> >  comm size is           96
> >  local array sizes are          12          12
> >  filling local arrays
> >  creating requests
> >  igather
> >  before collective wait
> >  after collective wait
> >  result is            1   1.00000000       1.00000000
> >  result is            2   1.41421354       1.41421354
> > [borg01w021:09264] 89 more processes have sent help message
> help-opal-shmem-mmap.txt / mmap on nfs
> > [borg01w021:09264] Set MCA parameter "orte_base_help_aggregate" to 0 to
> see all help / error messages
> >
> > where it prints out a few results.
> >
> > The worst case is most often seen on Sandy Bridge and is the most often
> failure:
> >
> > (1197) $ mpirun -np 96 ./mpi_reproducer.x 4 24
> > srun.slurm: cluster configuration lacks support for cpu binding
> > srun.slurm: cluster configuration lacks support for cpu binding
> >  nx:            4
> >  ny:           24
> >  comm size is           96
> >  local array sizes are          12          12
> >  filling local arrays
> >  creating requests
> >  igather
> >  before collective wait
> > [borg01w021:09367] 89 more processes have sent help message
> help-opal-shmem-mmap.txt / mmap on nfs
> > [borg01w021:09367] Set MCA parameter "orte_base_help_aggregate" to 0 to
> see all help / error messages
> >
> > This halt best compares to our full model code. It halts much at the
> same "place" around a collective wait.
> >
> > Finally, if I setenv OMPI_MCA_orte_base_help_aggregate 0 (to see all
> help/error messages) I usually just "hang" with no error message at all
> (additionally turning off the warning):
> >
> > (1203) $ setenv OMPI_MCA_orte_base_help_aggregate 0
> > (1203) $ setenv OMPI_MCA_shmem_mmap_enable_nfs_warning 0
> > (1204) $ mpirun -np 96 ./mpi_reproducer.x 4 24
> > srun.slurm: cluster configuration lacks support for cpu binding
> > srun.slurm: cluster configuration lacks support for cpu binding
> >  nx:            4
> >  ny:           24
> >  comm size is           96
> >  local array sizes are          12          12
> >  filling local arrays
> >  creating requests
> >  igather
> >  before collective wait
> >
> > Note, this problem doesn't seem to appear at lower number of processes
> (16, 24, 32) but does seem pretty consistent at 96, especially on Sandy
> Bridges.
> >
> > Also, yes, we get that weird srun.slurm warning but we always seem to
> get that (Open MPI, MVAPICH) so while our admins are trying to correct
> that, at present it is not our worry.
> >
> > The MPI stack was compiled with (per our admins):
> >
> > export CFLAGS="-fPIC -m64"
> > export CXXFLAGS="-fPIC -m64"
> > export FFLAGS="-fPIC"
> > export FCFLAGS="-fPIC"
> > export F90FLAGS="-fPIC"
> >
> > export LDFLAGS="-L/usr/nlocal/slurm/2.6.3/lib64"
> > export CPPFLAGS="-I/usr/nlocal/slurm/2.6.3/include"
> >
> > export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/nlocal/slurm/2.6.3/lib64
> >
> > ../configure --with-slurm --disable-wrapper-rpath --enable-shared
> > --enable-mca-no-build=btl-usnic --prefix=${PREFIX}
> >
> > The output of "ompi_info --all" is found:
> >
> >
> https://gist.github.com/mathomp4/301723165efbbb616184#file-ompi_info-out
> >
> > The reproducer code can be found here:
> >
> >
> https://gist.github.com/mathomp4/301723165efbbb616184#file-mpi_reproducer-f90
> >
> > The reproducer is easily built with just 'mpif90' and to run it:
> >
> >   mpirun -np NPROCS ./mpi_reproducer.x NX NY
> >
> > where NX*NY has to equal NPROCS and it's best to keep them even numbers.
> (There might be a few more restrictions and the code will die if you
> violate them.)
> >
> > Thanks,
> > Matt Thompson
> >
> > --
> > Matt Thompson          SSAI, Sr Software Test Engr
> > NASA GSFC, Global Modeling and Assimilation Office
> > Code 610.1, 8800 Greenbelt Rd, Greenbelt, MD 20771
> > Phone: 301-614-6712              Fax: 301-614-6246
> >
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> > Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/08/25022.php
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/08/25045.php
>



-- 
"And, isn't sanity really just a one-trick pony anyway? I mean all you
 get is one trick: rational thinking. But when you're good and crazy,
 oooh, oooh, oooh, the sky is the limit!" -- The Tick

Re: [OMPI users] Intermittent, somewhat architecture-dependent hang with Open MPI 1.8.1

Reply via email to