Folks, the problem is indeed pretty trivial to reproduce
i opened https://github.com/open-mpi/ompi/issues/2550 (and included a reproducer) Cheers, Gilles On Fri, Dec 9, 2016 at 5:15 AM, Noam Bernstein <noam.bernst...@nrl.navy.mil> wrote: > On Dec 8, 2016, at 6:05 AM, Gilles Gouaillardet > <gilles.gouaillar...@gmail.com> wrote: > > Christof, > > > There is something really odd with this stack trace. > count is zero, and some pointers do not point to valid addresses (!) > > in OpenMPI, MPI_Allreduce(...,count=0,...) is a no-op, so that suggests that > the stack has been corrupted inside MPI_Allreduce(), or that you are not > using the library you think you use > pmap <pid> will show you which lib is used > > btw, this was not started with > mpirun --mca coll ^tuned ... > right ? > > just to make it clear ... > a task from your program bluntly issues a fortran STOP, and this is kind of > a feature. > the *only* issue is mpirun does not kill the other MPI tasks and mpirun > never completes. > did i get it right ? > > > I just ran across very similar behavior in VASP (which we just switched over > to openmpi 2.0.1), also in a allreduce + STOP combination (some nodes call > one, others call the other), and I discovered several interesting things. > > The most important is that when MPI is active, the preprocessor converts > (via a #define in symbol.inc) fortran STOP into calls to m_exit() (defined > in mpi.F), which is a wrapper around mpi_finalize. So in my case some > processes in the communicator call mpi_finalize, others call mpi_allreduce. > I’m not really surprised this hangs, because I think the correct thing to > replace STOP with is mpi_abort, not mpi_finalize. If you know where the > STOP is called, you can check the preprocessed equivalent file (.f90 instead > of .F), and see if it’s actually been replaced with a call to m_exit. I’m > planning to test whether replacing m_exit with m_stop in symbol.inc gives > more sensible behavior, i.e. program termination when the original source > file executes a STOP. > > I’m assuming that a mix of mpi_allreduce and mpi_finalize is really expected > to hang, but just in case that’s surprising, here are my stack traces: > > > hung in collective: > > (gdb) where > > #0 0x00002b8d5a095ec6 in opal_progress () from > /usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libopen-pal.so.20 > #1 0x00002b8d59b3a36d in ompi_request_default_wait_all () from > /usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libmpi.so.20 > #2 0x00002b8d59b8107c in ompi_coll_base_allreduce_intra_recursivedoubling > () from /usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libmpi.so.20 > #3 0x00002b8d59b495ac in PMPI_Allreduce () from > /usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libmpi.so.20 > #4 0x00002b8d598e4027 in pmpi_allreduce__ () from > /usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libmpi_mpifh.so.20 > #5 0x0000000000414077 in m_sum_i (comm=..., ivec=warning: Range for type > (null) has invalid bounds 1..-12884901892 > warning: Range for type (null) has invalid bounds 1..-12884901892 > warning: Range for type (null) has invalid bounds 1..-12884901892 > warning: Range for type (null) has invalid bounds 1..-12884901892 > warning: Range for type (null) has invalid bounds 1..-12884901892 > warning: Range for type (null) has invalid bounds 1..-12884901892 > warning: Range for type (null) has invalid bounds 1..-12884901892 > ..., n=2) at mpi.F:989 > #6 0x0000000000daac54 in full_kpoints::set_indpw_full (grid=..., wdes=..., > kpoints_f=...) at mkpoints_full.F:1099 > #7 0x0000000001441654 in set_indpw_fock (t_info=..., p=warning: Range for > type (null) has invalid bounds 1..-1 > warning: Range for type (null) has invalid bounds 1..-1 > warning: Range for type (null) has invalid bounds 1..-1 > warning: Range for type (null) has invalid bounds 1..-1 > warning: Range for type (null) has invalid bounds 1..-1 > warning: Range for type (null) has invalid bounds 1..-1 > warning: Range for type (null) has invalid bounds 1..-1 > ..., wdes=..., grid=..., latt_cur=..., lmdim=Cannot access memory at address > 0x1 > ) at fock.F:1669 > #8 fock::setup_fock (t_info=..., p=warning: Range for type (null) has > invalid bounds 1..-1 > warning: Range for type (null) has invalid bounds 1..-1 > warning: Range for type (null) has invalid bounds 1..-1 > warning: Range for type (null) has invalid bounds 1..-1 > warning: Range for type (null) has invalid bounds 1..-1 > warning: Range for type (null) has invalid bounds 1..-1 > warning: Range for type (null) has invalid bounds 1..-1 > ..., wdes=..., grid=..., latt_cur=..., lmdim=Cannot access memory at address > 0x1 > ) at fock.F:1413 > #9 0x0000000002976478 in vamp () at main.F:2093 > #10 0x0000000000412f9e in main () > #11 0x000000383a41ed1d in __libc_start_main () from /lib64/libc.so.6 > #12 0x0000000000412ea9 in _start () > > > hung in mpi_finalize: > > #0 0x000000383a4acbdd in nanosleep () from /lib64/libc.so.6 > #1 0x000000383a4e1d94 in usleep () from /lib64/libc.so.6 > #2 0x00002b11db1e0ae7 in ompi_mpi_finalize () from > /usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libmpi.so.20 > #3 0x00002b11daf8b399 in pmpi_finalize__ () from > /usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libmpi_mpifh.so.20 > #4 0x00000000004199c5 in m_exit () at mpi.F:375 > #5 0x0000000000dab17f in full_kpoints::set_indpw_full (grid=..., > wdes=Cannot resolve DW_OP_push_object_address for a missing object > ) at mkpoints_full.F:1065 > #6 0x0000000001441654 in set_indpw_fock (t_info=..., p=Cannot resolve > DW_OP_push_object_address for a missing object > ) at fock.F:1669 > #7 fock::setup_fock (t_info=..., p=Cannot resolve DW_OP_push_object_address > for a missing object > ) at fock.F:1413 > #8 0x0000000002976478 in vamp () at main.F:2093 > #9 0x0000000000412f9e in main () > #10 0x000000383a41ed1d in __libc_start_main () from /lib64/libc.so.6 > #11 0x0000000000412ea9 in _start () > > > > ____________ > | > | > | > U.S. NAVAL > | > | > _RESEARCH_ > | > LABORATORY > > Noam Bernstein, Ph.D. > Center for Materials Physics and Technology > U.S. Naval Research Laboratory > T +1 202 404 8628 F +1 202 404 7546 > https://www.nrl.navy.mil > > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users _______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users