Christof,
There is something really odd with this stack trace. count is zero, and some pointers do not point to valid addresses (!) in OpenMPI, MPI_Allreduce(...,count=0,...) is a no-op, so that suggests that the stack has been corrupted inside MPI_Allreduce(), or that you are not using the library you think you use pmap <pid> will show you which lib is used btw, this was not started with mpirun --mca coll ^tuned ... right ? just to make it clear ... a task from your program bluntly issues a fortran STOP, and this is kind of a feature. the *only* issue is mpirun does not kill the other MPI tasks and mpirun never completes. did i get it right ? Cheers, Gilles On Thursday, December 8, 2016, Christof Koehler < christof.koeh...@bccms.uni-bremen.de> wrote: > Hello everybody, > > I tried it with the nightly and the direct 2.0.2 branch from git which > according to the log should contain that patch > > commit d0b97d7a408b87425ca53523de369da405358ba2 > Merge: ac8c019 b9420bb > Author: Jeff Squyres <jsquy...@users.noreply.github.com <javascript:;>> > Date: Wed Dec 7 18:24:46 2016 -0500 > Merge pull request #2528 from rhc54/cmr20x/signals > > Unfortunately it changes nothing. The root rank stops and all other > ranks (and mpirun) just stay, the remaining ranks at 100 % CPU waiting > apparently in that allreduce. The stack trace looks a bit more > interesting (git is always debug build ?), so I include it at the very > bottom just in case. > > Off-list Gilles Gouaillardet suggested to set breakpoints at exit, > __exit etc. to try to catch signals. Would that be useful ? I need a > moment to figure out how to do this, but I can definitively try. > > Some remark: During "make install" from the git repo I see a > > WARNING! Common symbols found: > mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_2complex > mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_2double_complex > mpi-f08-types.o: 0000000000000004 C > ompi_f08_mpi_2double_precision > mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_2integer > mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_2real > mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_aint > mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_band > mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_bor > mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_bxor > mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_byte > > I have never noticed this before. > > > Best Regards > > Christof > > Thread 1 (Thread 0x2af84cde4840 (LWP 11219)): > #0 0x00002af84e4c669d in poll () from /lib64/libc.so.6 > #1 0x00002af850517496 in poll_dispatch () from /cluster/mpi/openmpi/2.0.2/ > intel2016/lib/libopen-pal.so.20 > #2 0x00002af85050ffa5 in opal_libevent2022_event_base_loop () from > /cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-pal.so.20 > #3 0x00002af85049fa1f in opal_progress () at runtime/opal_progress.c:207 > #4 0x00002af84e02f7f7 in ompi_request_default_wait_all (count=233618144, > requests=0x2, statuses=0x0) at ../opal/threads/wait_sync.h:80 > #5 0x00002af84e0758a7 in ompi_coll_base_allreduce_intra_recursivedoubling > (sbuf=0xdecbae0, > rbuf=0x2, count=0, dtype=0xffffffffffffffff, op=0x0, comm=0x1, > module=0xdee69e0) at base/coll_base_allreduce.c:225 > #6 0x00002af84e07b747 in ompi_coll_tuned_allreduce_intra_dec_fixed > (sbuf=0xdecbae0, rbuf=0x2, count=0, dtype=0xffffffffffffffff, op=0x0, > comm=0x1, module=0x1) at coll_tuned_decision_fixed.c:66 > #7 0x00002af84e03e832 in PMPI_Allreduce (sendbuf=0xdecbae0, recvbuf=0x2, > count=0, datatype=0xffffffffffffffff, op=0x0, comm=0x1) at pallreduce.c:107 > #8 0x00002af84ddaac90 in ompi_allreduce_f (sendbuf=0xdecbae0 "\005", > recvbuf=0x2 <Address 0x2 out of bounds>, count=0x0, > datatype=0xffffffffffffffff, op=0x0, comm=0x1, ierr=0x7ffdf3cffe9c) at > pallreduce_f.c:87 > #9 0x000000000045ecc6 in m_sum_i_ () > #10 0x0000000000e172c9 in mlwf_mp_mlwf_wannier90_ () > #11 0x00000000004325ff in vamp () at main.F:2640 > #12 0x000000000040de1e in main () > #13 0x00002af84e3fbb15 in __libc_start_main () from /lib64/libc.so.6 > #14 0x000000000040dd29 in _start () > > On Wed, Dec 07, 2016 at 09:47:48AM -0800, r...@open-mpi.org <javascript:;> > wrote: > > Hi Christof > > > > Sorry if I missed this, but it sounds like you are saying that one of > your procs abnormally terminates, and we are failing to kill the remaining > job? Is that correct? > > > > If so, I just did some work that might relate to that problem that is > pending in PR #2528: https://github.com/open-mpi/ompi/pull/2528 < > https://github.com/open-mpi/ompi/pull/2528> > > > > Would you be able to try that? > > > > Ralph > > > > > On Dec 7, 2016, at 9:37 AM, Christof Koehler < > christof.koeh...@bccms.uni-bremen.de <javascript:;>> wrote: > > > > > > Hello, > > > > > > On Wed, Dec 07, 2016 at 10:19:10AM -0500, Noam Bernstein wrote: > > >>> On Dec 7, 2016, at 10:07 AM, Christof Koehler < > christof.koeh...@bccms.uni-bremen.de <javascript:;>> wrote: > > >>>> > > >>> I really think the hang is a consequence of > > >>> unclean termination (in the sense that the non-root ranks are not > > >>> terminated) and probably not the cause, in my interpretation of what > I > > >>> see. Would you have any suggestion to catch signals sent between > orterun > > >>> (mpirun) and the child tasks ? > > >> > > >> Do you know where in the code the termination call is? Is it > actually calling mpi_abort(), or just doing something ugly like calling > fortran “stop”? If the latter, would that explain a possible hang? > > > Well, basically it tries to use wannier90 (LWANNIER=.TRUE.). The > wannier90 input contains > > > an error, a restart is requested and the wannier90.chk file the restart > > > information is missing. > > > " > > > Exiting....... > > > Error: restart requested but wannier90.chk file not found > > > " > > > So it must terminate. > > > > > > The termination happens in the libwannier.a, source file io.F90: > > > > > > write(stdout,*) 'Exiting.......' > > > write(stdout, '(1x,a)') trim(error_msg) > > > close(stdout) > > > stop "wannier90 error: examine the output/error file for details" > > > > > > So it calls stop as you assumed. > > > > > >> Presumably someone here can comment on what the standard says about > the validity of terminating without mpi_abort. > > > > > > Well, probably stop is not a good way to terminate then. > > > > > > My main point was the change relative to 1.10 anyway :-) > > > > > > > > >> > > >> Actually, if you’re willing to share enough input files to reproduce, > I could take a look. I just recompiled our VASP with openmpi 2.0.1 to fix > a crash that was apparently addressed by some change in the memory > allocator in a recent version of openmpi. Just e-mail me if that’s the > case. > > > > > > I think that is no longer necessary ? In principle it is no problem but > > > it at the end of a (small) GW calculation, the Si tutorial example. > > > So the mail would be abit larger due to the WAVECAR. > > > > > > > > >> > > >> > Noam > > >> > > >> > > >> ____________ > > >> || > > >> |U.S. NAVAL| > > >> |_RESEARCH_| > > >> LABORATORY > > >> Noam Bernstein, Ph.D. > > >> Center for Materials Physics and Technology > > >> U.S. Naval Research Laboratory > > >> T +1 202 404 8628 F +1 202 404 7546 > > >> https://www.nrl.navy.mil <https://www.nrl.navy.mil/> > > > > > > -- > > > Dr. rer. nat. Christof Köhler email: > c.koeh...@bccms.uni-bremen.de <javascript:;> > > > Universitaet Bremen/ BCCMS phone: +49-(0)421-218-62334 > > > Am Fallturm 1/ TAB/ Raum 3.12 fax: +49-(0)421-218-62770 > > > 28359 Bremen > > > > > > PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/ > > > _______________________________________________ > > > users mailing list > > > users@lists.open-mpi.org <javascript:;> > > > https://rfd.newmexicoconsortium.org/mailman/listinfo/users > > > > -- > Dr. rer. nat. Christof Köhler email: c.koeh...@bccms.uni-bremen.de > <javascript:;> > Universitaet Bremen/ BCCMS phone: +49-(0)421-218-62334 > Am Fallturm 1/ TAB/ Raum 3.12 fax: +49-(0)421-218-62770 > 28359 Bremen > > PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/ >
_______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users