Jeff, I've tried moving the backing file and it doesn't matter. I can say that PGI 14.7 + Open MPI 1.8.1 does not show this issue. I can run that on 96 cores just fine. Heck, I've run it on a few hundred.
As for the 96, they are either on 8 Westmere nodes (8 nodes with 2 6-core sockets) or 6 Sandy Bridge nodes (6 nodes with 2 8-core sockets). I think each set is on a different Infiniband fabric, but I'm not sure of that. However, since the PGI 14.7/Open MPI 1.8.1 works just fine on the exact same sets of nodes (grabbed via an interactive SLURM job), I can't see how the Infiniband fabric would matter. I also tried various combinations of: mpirun --np mpirun --map-by core -np mpirun --map-by socket -np and maybe a few -bind-to as well all with --report-bindings on to make sure it was doing what I expect, and it was. It wasn't putting 96 processes on a single node, for example, or all on the same socket or core by some freak accident. The only difference between the Open MPI installs are the compilers they were built with (I'm pretty sure the admins just downloaded the source once). Looking at "mpif90 -showme" I can see that the PGI 14.7 compile built the mpi_f90 and mpi modules while it looks like the GCC 4.9.1 did not, but our main code and this reproducer only use mpif.h, so that shouldn't matter. Matt On Sat, Aug 16, 2014 at 7:33 AM, Jeff Squyres (jsquyres) <jsquy...@cisco.com > wrote: > Have you tried moving your shared memory backing file directory, like the > warning message suggests? > > I haven't seen a shared memory file on a network share cause correctness > issues before (just performance issues), but I could see how that could be > in the realm of possibility... > > Also, are you running 96 processes on a single machine, or spread across > multiple machines? > > Note that Open MPI 1.8.x binds each MPI process to a core by default, so > if you're oversubscribing the machine, it could be fairly disastrous...? > > > On Aug 14, 2014, at 1:29 PM, Matt Thompson <fort...@gmail.com> wrote: > > > Open MPI Users, > > > > I work on a large climate model called GEOS-5 and we've recently managed > to get it to compile with gfortran 4.9.1 (our usual compilers are Intel and > PGI for performance). In doing so, we asked our admins to install Open MPI > 1.8.1 as the MPI stack instead of MVAPICH2 2.0 mainly because we figure the > gfortran port is more geared to a desktop. > > > > So, the model builds just fine but when we run it, it stalls in our > "History" component whose job is to write out netCDF files of output. The > odd thing is, though, this stall seems to happen more on our Sandy Bridge > nodes than on our Westmere nodes, but both hang. > > > > A colleague has made a single-file code that emulates our History > component (the MPI traffic part) that we've used to report bugs to MVAPICH > and I asked him to try it with this issue and it seems to duplicate it. > > > > To wit, a "successful" run of the code is: > > > > (1003) $ mpirun -np 96 ./mpi_reproducer.x 4 24 > > srun.slurm: cluster configuration lacks support for cpu binding > > srun.slurm: cluster configuration lacks support for cpu binding > > > -------------------------------------------------------------------------- > > WARNING: Open MPI will create a shared memory backing file in a > > directory that appears to be mounted on a network filesystem. > > Creating the shared memory backup file on a network file system, such > > as NFS or Lustre is not recommended -- it may cause excessive network > > traffic to your file servers and/or cause shared memory traffic in > > Open MPI to be much slower than expected. > > > > You may want to check what the typical temporary directory is on your > > node. Possible sources of the location of this temporary directory > > include the $TEMPDIR, $TEMP, and $TMP environment variables. > > > > Note, too, that system administrators can set a list of filesystems > > where Open MPI is disallowed from creating temporary files by setting > > the MCA parameter "orte_no_session_dir". > > > > Local host: borg01s026 > > Fileame: > /gpfsm/dnb31/tdirs/pbs/slurm.2202701.mathomp4/openmpi-sessions-mathomp4@borg01s026_0 > /60464/1/shared_mem_pool.borg01s026 > > > > You can set the MCA paramter shmem_mmap_enable_nfs_warning to 0 to > > disable this message. > > > -------------------------------------------------------------------------- > > nx: 4 > > ny: 24 > > comm size is 96 > > local array sizes are 12 12 > > filling local arrays > > creating requests > > igather > > before collective wait > > after collective wait > > result is 1 1.00000000 1.00000000 > > result is 2 1.41421354 1.41421354 > > result is 3 1.73205078 1.73205078 > > result is 4 2.00000000 2.00000000 > > result is 5 2.23606801 2.23606801 > > result is 6 2.44948983 2.44948983 > > result is 7 2.64575124 2.64575124 > > result is 8 2.82842708 2.82842708 > > ...snip... > > result is 939 30.6431065 30.6431065 > > result is 940 30.6594200 30.6594200 > > result is 941 30.6757240 30.6757240 > > result is 942 30.6920185 30.6920185 > > result is 943 30.7083054 30.7083054 > > result is 944 30.7245827 30.7245827 > > result is 945 30.7408524 30.7408524 > > > > Where the second and third columns of numbers are just the square root > of the first. > > > > But, often, the runs do this (note I'm removing the > shmem_mmap_enable_nfs_warning message for sanity's sake from these copy and > pastes): > > > > (1196) $ mpirun -np 96 ./mpi_reproducer.x 4 24 > > srun.slurm: cluster configuration lacks support for cpu binding > > srun.slurm: cluster configuration lacks support for cpu binding > > nx: 4 > > ny: 24 > > comm size is 96 > > local array sizes are 12 12 > > filling local arrays > > creating requests > > igather > > before collective wait > > after collective wait > > result is 1 1.00000000 1.00000000 > > result is 2 1.41421354 1.41421354 > > [borg01w021:09264] 89 more processes have sent help message > help-opal-shmem-mmap.txt / mmap on nfs > > [borg01w021:09264] Set MCA parameter "orte_base_help_aggregate" to 0 to > see all help / error messages > > > > where it prints out a few results. > > > > The worst case is most often seen on Sandy Bridge and is the most often > failure: > > > > (1197) $ mpirun -np 96 ./mpi_reproducer.x 4 24 > > srun.slurm: cluster configuration lacks support for cpu binding > > srun.slurm: cluster configuration lacks support for cpu binding > > nx: 4 > > ny: 24 > > comm size is 96 > > local array sizes are 12 12 > > filling local arrays > > creating requests > > igather > > before collective wait > > [borg01w021:09367] 89 more processes have sent help message > help-opal-shmem-mmap.txt / mmap on nfs > > [borg01w021:09367] Set MCA parameter "orte_base_help_aggregate" to 0 to > see all help / error messages > > > > This halt best compares to our full model code. It halts much at the > same "place" around a collective wait. > > > > Finally, if I setenv OMPI_MCA_orte_base_help_aggregate 0 (to see all > help/error messages) I usually just "hang" with no error message at all > (additionally turning off the warning): > > > > (1203) $ setenv OMPI_MCA_orte_base_help_aggregate 0 > > (1203) $ setenv OMPI_MCA_shmem_mmap_enable_nfs_warning 0 > > (1204) $ mpirun -np 96 ./mpi_reproducer.x 4 24 > > srun.slurm: cluster configuration lacks support for cpu binding > > srun.slurm: cluster configuration lacks support for cpu binding > > nx: 4 > > ny: 24 > > comm size is 96 > > local array sizes are 12 12 > > filling local arrays > > creating requests > > igather > > before collective wait > > > > Note, this problem doesn't seem to appear at lower number of processes > (16, 24, 32) but does seem pretty consistent at 96, especially on Sandy > Bridges. > > > > Also, yes, we get that weird srun.slurm warning but we always seem to > get that (Open MPI, MVAPICH) so while our admins are trying to correct > that, at present it is not our worry. > > > > The MPI stack was compiled with (per our admins): > > > > export CFLAGS="-fPIC -m64" > > export CXXFLAGS="-fPIC -m64" > > export FFLAGS="-fPIC" > > export FCFLAGS="-fPIC" > > export F90FLAGS="-fPIC" > > > > export LDFLAGS="-L/usr/nlocal/slurm/2.6.3/lib64" > > export CPPFLAGS="-I/usr/nlocal/slurm/2.6.3/include" > > > > export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/nlocal/slurm/2.6.3/lib64 > > > > ../configure --with-slurm --disable-wrapper-rpath --enable-shared > > --enable-mca-no-build=btl-usnic --prefix=${PREFIX} > > > > The output of "ompi_info --all" is found: > > > > > https://gist.github.com/mathomp4/301723165efbbb616184#file-ompi_info-out > > > > The reproducer code can be found here: > > > > > https://gist.github.com/mathomp4/301723165efbbb616184#file-mpi_reproducer-f90 > > > > The reproducer is easily built with just 'mpif90' and to run it: > > > > mpirun -np NPROCS ./mpi_reproducer.x NX NY > > > > where NX*NY has to equal NPROCS and it's best to keep them even numbers. > (There might be a few more restrictions and the code will die if you > violate them.) > > > > Thanks, > > Matt Thompson > > > > -- > > Matt Thompson SSAI, Sr Software Test Engr > > NASA GSFC, Global Modeling and Assimilation Office > > Code 610.1, 8800 Greenbelt Rd, Greenbelt, MD 20771 > > Phone: 301-614-6712 Fax: 301-614-6246 > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/08/25022.php > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/08/25045.php > -- "And, isn't sanity really just a one-trick pony anyway? I mean all you get is one trick: rational thinking. But when you're good and crazy, oooh, oooh, oooh, the sky is the limit!" -- The Tick