Hello, our case is. The libwannier.a is a "third party" library which is built seperately and the just linked in. So the vasp preprocessor never touches it. As far as I can see no preprocessing of the f90 source is involved in the libwannier build process.
I finally managed to set a breakpoint at the program exit of the root rank: (gdb) bt #0 0x00002b7ccd2e4220 in _exit () from /lib64/libc.so.6 #1 0x00002b7ccd25ee2b in __run_exit_handlers () from /lib64/libc.so.6 #2 0x00002b7ccd25eeb5 in exit () from /lib64/libc.so.6 #3 0x000000000407298d in for_stop_core () #4 0x00000000012fad41 in w90_io_mp_io_error_ () #5 0x0000000001302147 in w90_parameters_mp_param_read_ () #6 0x00000000012f49c6 in wannier_setup_ () #7 0x0000000000e166a8 in mlwf_mp_mlwf_wannier90_ () #8 0x00000000004319ff in vamp () at main.F:2640 #9 0x000000000040d21e in main () #10 0x00002b7ccd247b15 in __libc_start_main () from /lib64/libc.so.6 #11 0x000000000040d129 in _start () So for_stop_core is called apparently ? Of course it is below the main() process of vasp, so additional things might happen which are not visible. Is SIGCHILD (as observed when catching signals in mpirun) the signal expectd after a for_stop_core ? Thank you very much for investigating this ! Cheers Christof On Thu, Dec 08, 2016 at 03:15:47PM -0500, Noam Bernstein wrote: > > On Dec 8, 2016, at 6:05 AM, Gilles Gouaillardet > > <gilles.gouaillar...@gmail.com> wrote: > > > > Christof, > > > > > > There is something really odd with this stack trace. > > count is zero, and some pointers do not point to valid addresses (!) > > > > in OpenMPI, MPI_Allreduce(...,count=0,...) is a no-op, so that suggests that > > the stack has been corrupted inside MPI_Allreduce(), or that you are not > > using the library you think you use > > pmap <pid> will show you which lib is used > > > > btw, this was not started with > > mpirun --mca coll ^tuned ... > > right ? > > > > just to make it clear ... > > a task from your program bluntly issues a fortran STOP, and this is kind of > > a feature. > > the *only* issue is mpirun does not kill the other MPI tasks and mpirun > > never completes. > > did i get it right ? > > I just ran across very similar behavior in VASP (which we just switched over > to openmpi 2.0.1), also in a allreduce + STOP combination (some nodes call > one, others call the other), and I discovered several interesting things. > > The most important is that when MPI is active, the preprocessor converts (via > a #define in symbol.inc) fortran STOP into calls to m_exit() (defined in > mpi.F), which is a wrapper around mpi_finalize. So in my case some processes > in the communicator call mpi_finalize, others call mpi_allreduce. I’m not > really surprised this hangs, because I think the correct thing to replace > STOP with is mpi_abort, not mpi_finalize. If you know where the STOP is > called, you can check the preprocessed equivalent file (.f90 instead of .F), > and see if it’s actually been replaced with a call to m_exit. I’m planning > to test whether replacing m_exit with m_stop in symbol.inc gives more > sensible behavior, i.e. program termination when the original source file > executes a STOP. > > I’m assuming that a mix of mpi_allreduce and mpi_finalize is really expected > to hang, but just in case that’s surprising, here are my stack traces: > > > hung in collective: > > (gdb) where > #0 0x00002b8d5a095ec6 in opal_progress () from > /usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libopen-pal.so.20 > #1 0x00002b8d59b3a36d in ompi_request_default_wait_all () from > /usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libmpi.so.20 > #2 0x00002b8d59b8107c in ompi_coll_base_allreduce_intra_recursivedoubling () > from /usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libmpi.so.20 > #3 0x00002b8d59b495ac in PMPI_Allreduce () from > /usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libmpi.so.20 > #4 0x00002b8d598e4027 in pmpi_allreduce__ () from > /usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libmpi_mpifh.so.20 > #5 0x0000000000414077 in m_sum_i (comm=..., ivec=warning: Range for type > (null) has invalid bounds 1..-12884901892 > warning: Range for type (null) has invalid bounds 1..-12884901892 > warning: Range for type (null) has invalid bounds 1..-12884901892 > warning: Range for type (null) has invalid bounds 1..-12884901892 > warning: Range for type (null) has invalid bounds 1..-12884901892 > warning: Range for type (null) has invalid bounds 1..-12884901892 > warning: Range for type (null) has invalid bounds 1..-12884901892 > ..., n=2) at mpi.F:989 > #6 0x0000000000daac54 in full_kpoints::set_indpw_full (grid=..., wdes=..., > kpoints_f=...) at mkpoints_full.F:1099 > #7 0x0000000001441654 in set_indpw_fock (t_info=..., p=warning: Range for > type (null) has invalid bounds 1..-1 > warning: Range for type (null) has invalid bounds 1..-1 > warning: Range for type (null) has invalid bounds 1..-1 > warning: Range for type (null) has invalid bounds 1..-1 > warning: Range for type (null) has invalid bounds 1..-1 > warning: Range for type (null) has invalid bounds 1..-1 > warning: Range for type (null) has invalid bounds 1..-1 > ..., wdes=..., grid=..., latt_cur=..., lmdim=Cannot access memory at address > 0x1 > ) at fock.F:1669 > #8 fock::setup_fock (t_info=..., p=warning: Range for type (null) has > invalid bounds 1..-1 > warning: Range for type (null) has invalid bounds 1..-1 > warning: Range for type (null) has invalid bounds 1..-1 > warning: Range for type (null) has invalid bounds 1..-1 > warning: Range for type (null) has invalid bounds 1..-1 > warning: Range for type (null) has invalid bounds 1..-1 > warning: Range for type (null) has invalid bounds 1..-1 > ..., wdes=..., grid=..., latt_cur=..., lmdim=Cannot access memory at address > 0x1 > ) at fock.F:1413 > #9 0x0000000002976478 in vamp () at main.F:2093 > #10 0x0000000000412f9e in main () > #11 0x000000383a41ed1d in __libc_start_main () from /lib64/libc.so.6 > #12 0x0000000000412ea9 in _start () > > hung in mpi_finalize: > > #0 0x000000383a4acbdd in nanosleep () from /lib64/libc.so.6 > #1 0x000000383a4e1d94 in usleep () from /lib64/libc.so.6 > #2 0x00002b11db1e0ae7 in ompi_mpi_finalize () from > /usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libmpi.so.20 > #3 0x00002b11daf8b399 in pmpi_finalize__ () from > /usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libmpi_mpifh.so.20 > #4 0x00000000004199c5 in m_exit () at mpi.F:375 > #5 0x0000000000dab17f in full_kpoints::set_indpw_full (grid=..., wdes=Cannot > resolve DW_OP_push_object_address for a missing object > ) at mkpoints_full.F:1065 > #6 0x0000000001441654 in set_indpw_fock (t_info=..., p=Cannot resolve > DW_OP_push_object_address for a missing object > ) at fock.F:1669 > #7 fock::setup_fock (t_info=..., p=Cannot resolve DW_OP_push_object_address > for a missing object > ) at fock.F:1413 > #8 0x0000000002976478 in vamp () at main.F:2093 > #9 0x0000000000412f9e in main () > #10 0x000000383a41ed1d in __libc_start_main () from /lib64/libc.so.6 > #11 0x0000000000412ea9 in _start () > > > > ____________ > || > |U.S. NAVAL| > |_RESEARCH_| > LABORATORY > Noam Bernstein, Ph.D. > Center for Materials Physics and Technology > U.S. Naval Research Laboratory > T +1 202 404 8628 F +1 202 404 7546 > https://www.nrl.navy.mil <https://www.nrl.navy.mil/> -- Dr. rer. nat. Christof Köhler email: c.koeh...@bccms.uni-bremen.de Universitaet Bremen/ BCCMS phone: +49-(0)421-218-62334 Am Fallturm 1/ TAB/ Raum 3.12 fax: +49-(0)421-218-62770 28359 Bremen PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
signature.asc
Description: Digital signature
_______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users