Open MPI Users, I work on a large climate model called GEOS-5 and we've recently managed to get it to compile with gfortran 4.9.1 (our usual compilers are Intel and PGI for performance). In doing so, we asked our admins to install Open MPI 1.8.1 as the MPI stack instead of MVAPICH2 2.0 mainly because we figure the gfortran port is more geared to a desktop.
So, the model builds just fine but when we run it, it stalls in our "History" component whose job is to write out netCDF files of output. The odd thing is, though, this stall seems to happen more on our Sandy Bridge nodes than on our Westmere nodes, but both hang. A colleague has made a single-file code that emulates our History component (the MPI traffic part) that we've used to report bugs to MVAPICH and I asked him to try it with this issue and it seems to duplicate it. To wit, a "successful" run of the code is: (1003) $ mpirun -np 96 ./mpi_reproducer.x 4 24 srun.slurm: cluster configuration lacks support for cpu binding srun.slurm: cluster configuration lacks support for cpu binding -------------------------------------------------------------------------- WARNING: Open MPI will create a shared memory backing file in a directory that appears to be mounted on a network filesystem. Creating the shared memory backup file on a network file system, such as NFS or Lustre is not recommended -- it may cause excessive network traffic to your file servers and/or cause shared memory traffic in Open MPI to be much slower than expected. You may want to check what the typical temporary directory is on your node. Possible sources of the location of this temporary directory include the $TEMPDIR, $TEMP, and $TMP environment variables. Note, too, that system administrators can set a list of filesystems where Open MPI is disallowed from creating temporary files by setting the MCA parameter "orte_no_session_dir". Local host: borg01s026 Fileame: /gpfsm/dnb31/tdirs/pbs/slurm.2202701.mathomp4/openmpi-sessions-mathomp4@borg01s026_0 /60464/1/shared_mem_pool.borg01s026 You can set the MCA paramter shmem_mmap_enable_nfs_warning to 0 to disable this message. -------------------------------------------------------------------------- nx: 4 ny: 24 comm size is 96 local array sizes are 12 12 filling local arrays creating requests igather before collective wait after collective wait result is 1 1.00000000 1.00000000 result is 2 1.41421354 1.41421354 result is 3 1.73205078 1.73205078 result is 4 2.00000000 2.00000000 result is 5 2.23606801 2.23606801 result is 6 2.44948983 2.44948983 result is 7 2.64575124 2.64575124 result is 8 2.82842708 2.82842708 ...snip... result is 939 30.6431065 30.6431065 result is 940 30.6594200 30.6594200 result is 941 30.6757240 30.6757240 result is 942 30.6920185 30.6920185 result is 943 30.7083054 30.7083054 result is 944 30.7245827 30.7245827 result is 945 30.7408524 30.7408524 Where the second and third columns of numbers are just the square root of the first. But, often, the runs do this (note I'm removing the shmem_mmap_enable_nfs_warning message for sanity's sake from these copy and pastes): (1196) $ mpirun -np 96 ./mpi_reproducer.x 4 24 srun.slurm: cluster configuration lacks support for cpu binding srun.slurm: cluster configuration lacks support for cpu binding nx: 4 ny: 24 comm size is 96 local array sizes are 12 12 filling local arrays creating requests igather before collective wait after collective wait result is 1 1.00000000 1.00000000 result is 2 1.41421354 1.41421354 [borg01w021:09264] 89 more processes have sent help message help-opal-shmem-mmap.txt / mmap on nfs [borg01w021:09264] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages where it prints out a few results. The worst case is most often seen on Sandy Bridge and is the most often failure: (1197) $ mpirun -np 96 ./mpi_reproducer.x 4 24 srun.slurm: cluster configuration lacks support for cpu binding srun.slurm: cluster configuration lacks support for cpu binding nx: 4 ny: 24 comm size is 96 local array sizes are 12 12 filling local arrays creating requests igather before collective wait [borg01w021:09367] 89 more processes have sent help message help-opal-shmem-mmap.txt / mmap on nfs [borg01w021:09367] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages This halt best compares to our full model code. It halts much at the same "place" around a collective wait. Finally, if I setenv OMPI_MCA_orte_base_help_aggregate 0 (to see all help/error messages) I usually just "hang" with no error message at all (additionally turning off the warning): (1203) $ setenv OMPI_MCA_orte_base_help_aggregate 0 (1203) $ setenv OMPI_MCA_shmem_mmap_enable_nfs_warning 0 (1204) $ mpirun -np 96 ./mpi_reproducer.x 4 24 srun.slurm: cluster configuration lacks support for cpu binding srun.slurm: cluster configuration lacks support for cpu binding nx: 4 ny: 24 comm size is 96 local array sizes are 12 12 filling local arrays creating requests igather before collective wait Note, this problem doesn't seem to appear at lower number of processes (16, 24, 32) but does seem pretty consistent at 96, especially on Sandy Bridges. Also, yes, we get that weird srun.slurm warning but we always seem to get that (Open MPI, MVAPICH) so while our admins are trying to correct that, at present it is not our worry. The MPI stack was compiled with (per our admins): export CFLAGS="-fPIC -m64" export CXXFLAGS="-fPIC -m64" export FFLAGS="-fPIC" export FCFLAGS="-fPIC" export F90FLAGS="-fPIC" export LDFLAGS="-L/usr/nlocal/slurm/2.6.3/lib64" export CPPFLAGS="-I/usr/nlocal/slurm/2.6.3/include" export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/nlocal/slurm/2.6.3/lib64 ../configure --with-slurm --disable-wrapper-rpath --enable-shared --enable-mca-no-build=btl-usnic --prefix=${PREFIX} The output of "ompi_info --all" is found: https://gist.github.com/mathomp4/301723165efbbb616184#file-ompi_info-out The reproducer code can be found here: https://gist.github.com/mathomp4/301723165efbbb616184#file-mpi_reproducer-f90 The reproducer is easily built with just 'mpif90' and to run it: mpirun -np NPROCS ./mpi_reproducer.x NX NY where NX*NY has to equal NPROCS and it's best to keep them even numbers. (There might be a few more restrictions and the code will die if you violate them.) Thanks, Matt Thompson -- Matt Thompson SSAI, Sr Software Test Engr NASA GSFC, Global Modeling and Assimilation Office Code 610.1, 8800 Greenbelt Rd, Greenbelt, MD 20771 Phone: 301-614-6712 Fax: 301-614-6246