Have you tried moving your shared memory backing file directory, like the warning message suggests?
I haven't seen a shared memory file on a network share cause correctness issues before (just performance issues), but I could see how that could be in the realm of possibility... Also, are you running 96 processes on a single machine, or spread across multiple machines? Note that Open MPI 1.8.x binds each MPI process to a core by default, so if you're oversubscribing the machine, it could be fairly disastrous...? On Aug 14, 2014, at 1:29 PM, Matt Thompson <fort...@gmail.com> wrote: > Open MPI Users, > > I work on a large climate model called GEOS-5 and we've recently managed to > get it to compile with gfortran 4.9.1 (our usual compilers are Intel and PGI > for performance). In doing so, we asked our admins to install Open MPI 1.8.1 > as the MPI stack instead of MVAPICH2 2.0 mainly because we figure the > gfortran port is more geared to a desktop. > > So, the model builds just fine but when we run it, it stalls in our "History" > component whose job is to write out netCDF files of output. The odd thing is, > though, this stall seems to happen more on our Sandy Bridge nodes than on our > Westmere nodes, but both hang. > > A colleague has made a single-file code that emulates our History component > (the MPI traffic part) that we've used to report bugs to MVAPICH and I asked > him to try it with this issue and it seems to duplicate it. > > To wit, a "successful" run of the code is: > > (1003) $ mpirun -np 96 ./mpi_reproducer.x 4 24 > srun.slurm: cluster configuration lacks support for cpu binding > srun.slurm: cluster configuration lacks support for cpu binding > -------------------------------------------------------------------------- > WARNING: Open MPI will create a shared memory backing file in a > directory that appears to be mounted on a network filesystem. > Creating the shared memory backup file on a network file system, such > as NFS or Lustre is not recommended -- it may cause excessive network > traffic to your file servers and/or cause shared memory traffic in > Open MPI to be much slower than expected. > > You may want to check what the typical temporary directory is on your > node. Possible sources of the location of this temporary directory > include the $TEMPDIR, $TEMP, and $TMP environment variables. > > Note, too, that system administrators can set a list of filesystems > where Open MPI is disallowed from creating temporary files by setting > the MCA parameter "orte_no_session_dir". > > Local host: borg01s026 > Fileame: > /gpfsm/dnb31/tdirs/pbs/slurm.2202701.mathomp4/openmpi-sessions-mathomp4@borg01s026_0/60464/1/shared_mem_pool.borg01s026 > > You can set the MCA paramter shmem_mmap_enable_nfs_warning to 0 to > disable this message. > -------------------------------------------------------------------------- > nx: 4 > ny: 24 > comm size is 96 > local array sizes are 12 12 > filling local arrays > creating requests > igather > before collective wait > after collective wait > result is 1 1.00000000 1.00000000 > result is 2 1.41421354 1.41421354 > result is 3 1.73205078 1.73205078 > result is 4 2.00000000 2.00000000 > result is 5 2.23606801 2.23606801 > result is 6 2.44948983 2.44948983 > result is 7 2.64575124 2.64575124 > result is 8 2.82842708 2.82842708 > ...snip... > result is 939 30.6431065 30.6431065 > result is 940 30.6594200 30.6594200 > result is 941 30.6757240 30.6757240 > result is 942 30.6920185 30.6920185 > result is 943 30.7083054 30.7083054 > result is 944 30.7245827 30.7245827 > result is 945 30.7408524 30.7408524 > > Where the second and third columns of numbers are just the square root of the > first. > > But, often, the runs do this (note I'm removing the > shmem_mmap_enable_nfs_warning message for sanity's sake from these copy and > pastes): > > (1196) $ mpirun -np 96 ./mpi_reproducer.x 4 24 > srun.slurm: cluster configuration lacks support for cpu binding > srun.slurm: cluster configuration lacks support for cpu binding > nx: 4 > ny: 24 > comm size is 96 > local array sizes are 12 12 > filling local arrays > creating requests > igather > before collective wait > after collective wait > result is 1 1.00000000 1.00000000 > result is 2 1.41421354 1.41421354 > [borg01w021:09264] 89 more processes have sent help message > help-opal-shmem-mmap.txt / mmap on nfs > [borg01w021:09264] Set MCA parameter "orte_base_help_aggregate" to 0 to see > all help / error messages > > where it prints out a few results. > > The worst case is most often seen on Sandy Bridge and is the most often > failure: > > (1197) $ mpirun -np 96 ./mpi_reproducer.x 4 24 > srun.slurm: cluster configuration lacks support for cpu binding > srun.slurm: cluster configuration lacks support for cpu binding > nx: 4 > ny: 24 > comm size is 96 > local array sizes are 12 12 > filling local arrays > creating requests > igather > before collective wait > [borg01w021:09367] 89 more processes have sent help message > help-opal-shmem-mmap.txt / mmap on nfs > [borg01w021:09367] Set MCA parameter "orte_base_help_aggregate" to 0 to see > all help / error messages > > This halt best compares to our full model code. It halts much at the same > "place" around a collective wait. > > Finally, if I setenv OMPI_MCA_orte_base_help_aggregate 0 (to see all > help/error messages) I usually just "hang" with no error message at all > (additionally turning off the warning): > > (1203) $ setenv OMPI_MCA_orte_base_help_aggregate 0 > (1203) $ setenv OMPI_MCA_shmem_mmap_enable_nfs_warning 0 > (1204) $ mpirun -np 96 ./mpi_reproducer.x 4 24 > srun.slurm: cluster configuration lacks support for cpu binding > srun.slurm: cluster configuration lacks support for cpu binding > nx: 4 > ny: 24 > comm size is 96 > local array sizes are 12 12 > filling local arrays > creating requests > igather > before collective wait > > Note, this problem doesn't seem to appear at lower number of processes (16, > 24, 32) but does seem pretty consistent at 96, especially on Sandy Bridges. > > Also, yes, we get that weird srun.slurm warning but we always seem to get > that (Open MPI, MVAPICH) so while our admins are trying to correct that, at > present it is not our worry. > > The MPI stack was compiled with (per our admins): > > export CFLAGS="-fPIC -m64" > export CXXFLAGS="-fPIC -m64" > export FFLAGS="-fPIC" > export FCFLAGS="-fPIC" > export F90FLAGS="-fPIC" > > export LDFLAGS="-L/usr/nlocal/slurm/2.6.3/lib64" > export CPPFLAGS="-I/usr/nlocal/slurm/2.6.3/include" > > export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/nlocal/slurm/2.6.3/lib64 > > ../configure --with-slurm --disable-wrapper-rpath --enable-shared > --enable-mca-no-build=btl-usnic --prefix=${PREFIX} > > The output of "ompi_info --all" is found: > > https://gist.github.com/mathomp4/301723165efbbb616184#file-ompi_info-out > > The reproducer code can be found here: > > > https://gist.github.com/mathomp4/301723165efbbb616184#file-mpi_reproducer-f90 > > The reproducer is easily built with just 'mpif90' and to run it: > > mpirun -np NPROCS ./mpi_reproducer.x NX NY > > where NX*NY has to equal NPROCS and it's best to keep them even numbers. > (There might be a few more restrictions and the code will die if you violate > them.) > > Thanks, > Matt Thompson > > -- > Matt Thompson SSAI, Sr Software Test Engr > NASA GSFC, Global Modeling and Assimilation Office > Code 610.1, 8800 Greenbelt Rd, Greenbelt, MD 20771 > Phone: 301-614-6712 Fax: 301-614-6246 > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/08/25022.php -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/