Open MPI Users,

I work on a large climate model called GEOS-5 and we've recently managed to
get it to compile with gfortran 4.9.1 (our usual compilers are Intel and
PGI for performance). In doing so, we asked our admins to install Open MPI
1.8.1 as the MPI stack instead of MVAPICH2 2.0 mainly because we figure the
gfortran port is more geared to a desktop.

So, the model builds just fine but when we run it, it stalls in our
"History" component whose job is to write out netCDF files of output. The
odd thing is, though, this stall seems to happen more on our Sandy Bridge
nodes than on our Westmere nodes, but both hang.

A colleague has made a single-file code that emulates our History component
(the MPI traffic part) that we've used to report bugs to MVAPICH and I
asked him to try it with this issue and it seems to duplicate it.

To wit, a "successful" run of the code is:

(1003) $ mpirun -np 96 ./mpi_reproducer.x 4 24
srun.slurm: cluster configuration lacks support for cpu binding
srun.slurm: cluster configuration lacks support for cpu binding
--------------------------------------------------------------------------
WARNING: Open MPI will create a shared memory backing file in a
directory that appears to be mounted on a network filesystem.
Creating the shared memory backup file on a network file system, such
as NFS or Lustre is not recommended -- it may cause excessive network
traffic to your file servers and/or cause shared memory traffic in
Open MPI to be much slower than expected.

You may want to check what the typical temporary directory is on your
node.  Possible sources of the location of this temporary directory
include the $TEMPDIR, $TEMP, and $TMP environment variables.

Note, too, that system administrators can set a list of filesystems
where Open MPI is disallowed from creating temporary files by setting
the MCA parameter "orte_no_session_dir".

  Local host: borg01s026
  Fileame:
 
/gpfsm/dnb31/tdirs/pbs/slurm.2202701.mathomp4/openmpi-sessions-mathomp4@borg01s026_0
/60464/1/shared_mem_pool.borg01s026

You can set the MCA paramter shmem_mmap_enable_nfs_warning to 0 to
disable this message.
--------------------------------------------------------------------------
 nx:            4
 ny:           24
 comm size is           96
 local array sizes are          12          12
 filling local arrays
 creating requests
 igather
 before collective wait
 after collective wait
 result is            1   1.00000000       1.00000000
 result is            2   1.41421354       1.41421354
 result is            3   1.73205078       1.73205078
 result is            4   2.00000000       2.00000000
 result is            5   2.23606801       2.23606801
 result is            6   2.44948983       2.44948983
 result is            7   2.64575124       2.64575124
 result is            8   2.82842708       2.82842708
...snip...
 result is          939   30.6431065       30.6431065
 result is          940   30.6594200       30.6594200
 result is          941   30.6757240       30.6757240
 result is          942   30.6920185       30.6920185
 result is          943   30.7083054       30.7083054
 result is          944   30.7245827       30.7245827
 result is          945   30.7408524       30.7408524

Where the second and third columns of numbers are just the square root of
the first.

But, often, the runs do this (note I'm removing the
shmem_mmap_enable_nfs_warning message for sanity's sake from these copy and
pastes):

(1196) $ mpirun -np 96 ./mpi_reproducer.x 4 24
srun.slurm: cluster configuration lacks support for cpu binding
srun.slurm: cluster configuration lacks support for cpu binding
 nx:            4
 ny:           24
 comm size is           96
 local array sizes are          12          12
 filling local arrays
 creating requests
 igather
 before collective wait
 after collective wait
 result is            1   1.00000000       1.00000000
 result is            2   1.41421354       1.41421354
[borg01w021:09264] 89 more processes have sent help message
help-opal-shmem-mmap.txt / mmap on nfs
[borg01w021:09264] Set MCA parameter "orte_base_help_aggregate" to 0 to see
all help / error messages

where it prints out a few results.

The worst case is most often seen on Sandy Bridge and is the most often
failure:

(1197) $ mpirun -np 96 ./mpi_reproducer.x 4 24
srun.slurm: cluster configuration lacks support for cpu binding
srun.slurm: cluster configuration lacks support for cpu binding
 nx:            4
 ny:           24
 comm size is           96
 local array sizes are          12          12
 filling local arrays
 creating requests
 igather
 before collective wait
[borg01w021:09367] 89 more processes have sent help message
help-opal-shmem-mmap.txt / mmap on nfs
[borg01w021:09367] Set MCA parameter "orte_base_help_aggregate" to 0 to see
all help / error messages

This halt best compares to our full model code. It halts much at the same
"place" around a collective wait.

Finally, if I setenv OMPI_MCA_orte_base_help_aggregate 0 (to see all
help/error messages) I usually just "hang" with no error message at all
(additionally turning off the warning):

(1203) $ setenv OMPI_MCA_orte_base_help_aggregate 0
(1203) $ setenv OMPI_MCA_shmem_mmap_enable_nfs_warning 0
(1204) $ mpirun -np 96 ./mpi_reproducer.x 4 24
srun.slurm: cluster configuration lacks support for cpu binding
srun.slurm: cluster configuration lacks support for cpu binding
 nx:            4
 ny:           24
 comm size is           96
 local array sizes are          12          12
 filling local arrays
 creating requests
 igather
 before collective wait

Note, this problem doesn't seem to appear at lower number of processes (16,
24, 32) but does seem pretty consistent at 96, especially on Sandy Bridges.

Also, yes, we get that weird srun.slurm warning but we always seem to get
that (Open MPI, MVAPICH) so while our admins are trying to correct that, at
present it is not our worry.

The MPI stack was compiled with (per our admins):

export CFLAGS="-fPIC -m64"
export CXXFLAGS="-fPIC -m64"
export FFLAGS="-fPIC"
export FCFLAGS="-fPIC"
export F90FLAGS="-fPIC"

export LDFLAGS="-L/usr/nlocal/slurm/2.6.3/lib64"
export CPPFLAGS="-I/usr/nlocal/slurm/2.6.3/include"

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/nlocal/slurm/2.6.3/lib64

../configure --with-slurm --disable-wrapper-rpath --enable-shared
--enable-mca-no-build=btl-usnic --prefix=${PREFIX}

The output of "ompi_info --all" is found:

  https://gist.github.com/mathomp4/301723165efbbb616184#file-ompi_info-out

The reproducer code can be found here:


https://gist.github.com/mathomp4/301723165efbbb616184#file-mpi_reproducer-f90

The reproducer is easily built with just 'mpif90' and to run it:

  mpirun -np NPROCS ./mpi_reproducer.x NX NY

where NX*NY has to equal NPROCS and it's best to keep them even numbers.
(There might be a few more restrictions and the code will die if you
violate them.)

Thanks,
Matt Thompson

-- 
Matt Thompson          SSAI, Sr Software Test Engr
NASA GSFC, Global Modeling and Assimilation Office
Code 610.1, 8800 Greenbelt Rd, Greenbelt, MD 20771
Phone: 301-614-6712              Fax: 301-614-6246

Reply via email to