Have you tried moving your shared memory backing file directory, like the 
warning message suggests?

I haven't seen a shared memory file on a network share cause correctness issues 
before (just performance issues), but I could see how that could be in the 
realm of possibility...

Also, are you running 96 processes on a single machine, or spread across 
multiple machines?

Note that Open MPI 1.8.x binds each MPI process to a core by default, so if 
you're oversubscribing the machine, it could be fairly disastrous...?


On Aug 14, 2014, at 1:29 PM, Matt Thompson <fort...@gmail.com> wrote:

> Open MPI Users,
> 
> I work on a large climate model called GEOS-5 and we've recently managed to 
> get it to compile with gfortran 4.9.1 (our usual compilers are Intel and PGI 
> for performance). In doing so, we asked our admins to install Open MPI 1.8.1 
> as the MPI stack instead of MVAPICH2 2.0 mainly because we figure the 
> gfortran port is more geared to a desktop.
> 
> So, the model builds just fine but when we run it, it stalls in our "History" 
> component whose job is to write out netCDF files of output. The odd thing is, 
> though, this stall seems to happen more on our Sandy Bridge nodes than on our 
> Westmere nodes, but both hang.
> 
> A colleague has made a single-file code that emulates our History component 
> (the MPI traffic part) that we've used to report bugs to MVAPICH and I asked 
> him to try it with this issue and it seems to duplicate it.
> 
> To wit, a "successful" run of the code is:
> 
> (1003) $ mpirun -np 96 ./mpi_reproducer.x 4 24
> srun.slurm: cluster configuration lacks support for cpu binding
> srun.slurm: cluster configuration lacks support for cpu binding
> --------------------------------------------------------------------------
> WARNING: Open MPI will create a shared memory backing file in a
> directory that appears to be mounted on a network filesystem.
> Creating the shared memory backup file on a network file system, such
> as NFS or Lustre is not recommended -- it may cause excessive network
> traffic to your file servers and/or cause shared memory traffic in
> Open MPI to be much slower than expected.
> 
> You may want to check what the typical temporary directory is on your
> node.  Possible sources of the location of this temporary directory
> include the $TEMPDIR, $TEMP, and $TMP environment variables.
> 
> Note, too, that system administrators can set a list of filesystems
> where Open MPI is disallowed from creating temporary files by setting
> the MCA parameter "orte_no_session_dir".
> 
>   Local host: borg01s026
>   Fileame:    
> /gpfsm/dnb31/tdirs/pbs/slurm.2202701.mathomp4/openmpi-sessions-mathomp4@borg01s026_0/60464/1/shared_mem_pool.borg01s026
> 
> You can set the MCA paramter shmem_mmap_enable_nfs_warning to 0 to
> disable this message.
> --------------------------------------------------------------------------
>  nx:            4
>  ny:           24
>  comm size is           96
>  local array sizes are          12          12
>  filling local arrays
>  creating requests
>  igather
>  before collective wait
>  after collective wait
>  result is            1   1.00000000       1.00000000    
>  result is            2   1.41421354       1.41421354    
>  result is            3   1.73205078       1.73205078    
>  result is            4   2.00000000       2.00000000    
>  result is            5   2.23606801       2.23606801    
>  result is            6   2.44948983       2.44948983    
>  result is            7   2.64575124       2.64575124    
>  result is            8   2.82842708       2.82842708    
> ...snip...
>  result is          939   30.6431065       30.6431065    
>  result is          940   30.6594200       30.6594200    
>  result is          941   30.6757240       30.6757240    
>  result is          942   30.6920185       30.6920185    
>  result is          943   30.7083054       30.7083054    
>  result is          944   30.7245827       30.7245827    
>  result is          945   30.7408524       30.7408524    
> 
> Where the second and third columns of numbers are just the square root of the 
> first.
> 
> But, often, the runs do this (note I'm removing the 
> shmem_mmap_enable_nfs_warning message for sanity's sake from these copy and 
> pastes):
> 
> (1196) $ mpirun -np 96 ./mpi_reproducer.x 4 24
> srun.slurm: cluster configuration lacks support for cpu binding
> srun.slurm: cluster configuration lacks support for cpu binding
>  nx:            4
>  ny:           24
>  comm size is           96
>  local array sizes are          12          12
>  filling local arrays
>  creating requests
>  igather
>  before collective wait
>  after collective wait
>  result is            1   1.00000000       1.00000000    
>  result is            2   1.41421354       1.41421354    
> [borg01w021:09264] 89 more processes have sent help message 
> help-opal-shmem-mmap.txt / mmap on nfs
> [borg01w021:09264] Set MCA parameter "orte_base_help_aggregate" to 0 to see 
> all help / error messages
> 
> where it prints out a few results.
> 
> The worst case is most often seen on Sandy Bridge and is the most often 
> failure:
> 
> (1197) $ mpirun -np 96 ./mpi_reproducer.x 4 24
> srun.slurm: cluster configuration lacks support for cpu binding
> srun.slurm: cluster configuration lacks support for cpu binding
>  nx:            4
>  ny:           24
>  comm size is           96
>  local array sizes are          12          12
>  filling local arrays
>  creating requests
>  igather
>  before collective wait
> [borg01w021:09367] 89 more processes have sent help message 
> help-opal-shmem-mmap.txt / mmap on nfs
> [borg01w021:09367] Set MCA parameter "orte_base_help_aggregate" to 0 to see 
> all help / error messages
> 
> This halt best compares to our full model code. It halts much at the same 
> "place" around a collective wait.
> 
> Finally, if I setenv OMPI_MCA_orte_base_help_aggregate 0 (to see all 
> help/error messages) I usually just "hang" with no error message at all 
> (additionally turning off the warning):
> 
> (1203) $ setenv OMPI_MCA_orte_base_help_aggregate 0
> (1203) $ setenv OMPI_MCA_shmem_mmap_enable_nfs_warning 0
> (1204) $ mpirun -np 96 ./mpi_reproducer.x 4 24
> srun.slurm: cluster configuration lacks support for cpu binding
> srun.slurm: cluster configuration lacks support for cpu binding
>  nx:            4
>  ny:           24
>  comm size is           96
>  local array sizes are          12          12
>  filling local arrays
>  creating requests
>  igather
>  before collective wait
> 
> Note, this problem doesn't seem to appear at lower number of processes (16, 
> 24, 32) but does seem pretty consistent at 96, especially on Sandy Bridges.
> 
> Also, yes, we get that weird srun.slurm warning but we always seem to get 
> that (Open MPI, MVAPICH) so while our admins are trying to correct that, at 
> present it is not our worry.
> 
> The MPI stack was compiled with (per our admins):
> 
> export CFLAGS="-fPIC -m64"
> export CXXFLAGS="-fPIC -m64"
> export FFLAGS="-fPIC"
> export FCFLAGS="-fPIC"
> export F90FLAGS="-fPIC"
> 
> export LDFLAGS="-L/usr/nlocal/slurm/2.6.3/lib64"
> export CPPFLAGS="-I/usr/nlocal/slurm/2.6.3/include"
> 
> export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/nlocal/slurm/2.6.3/lib64
> 
> ../configure --with-slurm --disable-wrapper-rpath --enable-shared
> --enable-mca-no-build=btl-usnic --prefix=${PREFIX}
> 
> The output of "ompi_info --all" is found:
> 
>   https://gist.github.com/mathomp4/301723165efbbb616184#file-ompi_info-out
> 
> The reproducer code can be found here:
> 
>   
> https://gist.github.com/mathomp4/301723165efbbb616184#file-mpi_reproducer-f90
> 
> The reproducer is easily built with just 'mpif90' and to run it:
> 
>   mpirun -np NPROCS ./mpi_reproducer.x NX NY
> 
> where NX*NY has to equal NPROCS and it's best to keep them even numbers. 
> (There might be a few more restrictions and the code will die if you violate 
> them.)
> 
> Thanks,
> Matt Thompson
> 
> -- 
> Matt Thompson          SSAI, Sr Software Test Engr
> NASA GSFC, Global Modeling and Assimilation Office
> Code 610.1, 8800 Greenbelt Rd, Greenbelt, MD 20771
> Phone: 301-614-6712              Fax: 301-614-6246 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/08/25022.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Reply via email to