Folks,

the problem is indeed pretty trivial to reproduce

i opened https://github.com/open-mpi/ompi/issues/2550 (and included a
reproducer)


Cheers,

Gilles

On Fri, Dec 9, 2016 at 5:15 AM, Noam Bernstein
<noam.bernst...@nrl.navy.mil> wrote:
> On Dec 8, 2016, at 6:05 AM, Gilles Gouaillardet
> <gilles.gouaillar...@gmail.com> wrote:
>
> Christof,
>
>
> There is something really odd with this stack trace.
> count is zero, and some pointers do not point to valid addresses (!)
>
> in OpenMPI, MPI_Allreduce(...,count=0,...) is a no-op, so that suggests that
> the stack has been corrupted inside MPI_Allreduce(), or that you are not
> using the library you think you use
> pmap <pid> will show you which lib is used
>
> btw, this was not started with
> mpirun --mca coll ^tuned ...
> right ?
>
> just to make it clear ...
> a task from your program bluntly issues a fortran STOP, and this is kind of
> a feature.
> the *only* issue is mpirun does not kill the other MPI tasks and mpirun
> never completes.
> did i get it right ?
>
>
> I just ran across very similar behavior in VASP (which we just switched over
> to openmpi 2.0.1), also in a allreduce + STOP combination (some nodes call
> one, others call the other), and I discovered several interesting things.
>
> The most important is that when MPI is active, the preprocessor converts
> (via a #define in symbol.inc) fortran STOP into calls to m_exit() (defined
> in mpi.F), which is a wrapper around mpi_finalize.  So in my case some
> processes in the communicator call mpi_finalize, others call mpi_allreduce.
> I’m not really surprised this hangs, because I think the correct thing to
> replace STOP with is mpi_abort, not mpi_finalize.  If you know where the
> STOP is called, you can check the preprocessed equivalent file (.f90 instead
> of .F), and see if it’s actually been replaced with a call to m_exit.  I’m
> planning to test whether replacing m_exit with m_stop in symbol.inc gives
> more sensible behavior, i.e. program termination when the original source
> file executes a STOP.
>
> I’m assuming that a mix of mpi_allreduce and mpi_finalize is really expected
> to hang, but just in case that’s surprising, here are my stack traces:
>
>
> hung in collective:
>
> (gdb) where
>
> #0  0x00002b8d5a095ec6 in opal_progress () from
> /usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libopen-pal.so.20
> #1  0x00002b8d59b3a36d in ompi_request_default_wait_all () from
> /usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libmpi.so.20
> #2  0x00002b8d59b8107c in ompi_coll_base_allreduce_intra_recursivedoubling
> () from /usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libmpi.so.20
> #3  0x00002b8d59b495ac in PMPI_Allreduce () from
> /usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libmpi.so.20
> #4  0x00002b8d598e4027 in pmpi_allreduce__ () from
> /usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libmpi_mpifh.so.20
> #5  0x0000000000414077 in m_sum_i (comm=..., ivec=warning: Range for type
> (null) has invalid bounds 1..-12884901892
> warning: Range for type (null) has invalid bounds 1..-12884901892
> warning: Range for type (null) has invalid bounds 1..-12884901892
> warning: Range for type (null) has invalid bounds 1..-12884901892
> warning: Range for type (null) has invalid bounds 1..-12884901892
> warning: Range for type (null) has invalid bounds 1..-12884901892
> warning: Range for type (null) has invalid bounds 1..-12884901892
> ..., n=2) at mpi.F:989
> #6  0x0000000000daac54 in full_kpoints::set_indpw_full (grid=..., wdes=...,
> kpoints_f=...) at mkpoints_full.F:1099
> #7  0x0000000001441654 in set_indpw_fock (t_info=..., p=warning: Range for
> type (null) has invalid bounds 1..-1
> warning: Range for type (null) has invalid bounds 1..-1
> warning: Range for type (null) has invalid bounds 1..-1
> warning: Range for type (null) has invalid bounds 1..-1
> warning: Range for type (null) has invalid bounds 1..-1
> warning: Range for type (null) has invalid bounds 1..-1
> warning: Range for type (null) has invalid bounds 1..-1
> ..., wdes=..., grid=..., latt_cur=..., lmdim=Cannot access memory at address
> 0x1
> ) at fock.F:1669
> #8  fock::setup_fock (t_info=..., p=warning: Range for type (null) has
> invalid bounds 1..-1
> warning: Range for type (null) has invalid bounds 1..-1
> warning: Range for type (null) has invalid bounds 1..-1
> warning: Range for type (null) has invalid bounds 1..-1
> warning: Range for type (null) has invalid bounds 1..-1
> warning: Range for type (null) has invalid bounds 1..-1
> warning: Range for type (null) has invalid bounds 1..-1
> ..., wdes=..., grid=..., latt_cur=..., lmdim=Cannot access memory at address
> 0x1
> ) at fock.F:1413
> #9  0x0000000002976478 in vamp () at main.F:2093
> #10 0x0000000000412f9e in main ()
> #11 0x000000383a41ed1d in __libc_start_main () from /lib64/libc.so.6
> #12 0x0000000000412ea9 in _start ()
>
>
> hung in mpi_finalize:
>
> #0  0x000000383a4acbdd in nanosleep () from /lib64/libc.so.6
> #1  0x000000383a4e1d94 in usleep () from /lib64/libc.so.6
> #2  0x00002b11db1e0ae7 in ompi_mpi_finalize () from
> /usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libmpi.so.20
> #3  0x00002b11daf8b399 in pmpi_finalize__ () from
> /usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libmpi_mpifh.so.20
> #4  0x00000000004199c5 in m_exit () at mpi.F:375
> #5  0x0000000000dab17f in full_kpoints::set_indpw_full (grid=...,
> wdes=Cannot resolve DW_OP_push_object_address for a missing object
> ) at mkpoints_full.F:1065
> #6  0x0000000001441654 in set_indpw_fock (t_info=..., p=Cannot resolve
> DW_OP_push_object_address for a missing object
> ) at fock.F:1669
> #7  fock::setup_fock (t_info=..., p=Cannot resolve DW_OP_push_object_address
> for a missing object
> ) at fock.F:1413
> #8  0x0000000002976478 in vamp () at main.F:2093
> #9  0x0000000000412f9e in main ()
> #10 0x000000383a41ed1d in __libc_start_main () from /lib64/libc.so.6
> #11 0x0000000000412ea9 in _start ()
>
>
>
> ____________
> |
> |
> |
> U.S. NAVAL
> |
> |
> _RESEARCH_
> |
> LABORATORY
>
> Noam Bernstein, Ph.D.
> Center for Materials Physics and Technology
> U.S. Naval Research Laboratory
> T +1 202 404 8628  F +1 202 404 7546
> https://www.nrl.navy.mil
>
>
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to