Christof,

There is something really odd with this stack trace.
count is zero, and some pointers do not point to valid addresses (!)

in OpenMPI, MPI_Allreduce(...,count=0,...) is a no-op, so that suggests that
the stack has been corrupted inside MPI_Allreduce(), or that you are not
using the library you think you use
pmap <pid> will show you which lib is used

btw, this was not started with
mpirun --mca coll ^tuned ...
right ?

just to make it clear ...
a task from your program bluntly issues a fortran STOP, and this is kind of
a feature.
the *only* issue is mpirun does not kill the other MPI tasks and mpirun
never completes.
did i get it right ?

Cheers,

Gilles

On Thursday, December 8, 2016, Christof Koehler <
christof.koeh...@bccms.uni-bremen.de> wrote:

> Hello everybody,
>
> I tried it with the nightly and the direct 2.0.2 branch from git which
> according to the log should contain that patch
>
> commit d0b97d7a408b87425ca53523de369da405358ba2
> Merge: ac8c019 b9420bb
> Author: Jeff Squyres <jsquy...@users.noreply.github.com <javascript:;>>
> Date:   Wed Dec 7 18:24:46 2016 -0500
>     Merge pull request #2528 from rhc54/cmr20x/signals
>
> Unfortunately it changes nothing. The root rank stops and all other
> ranks (and mpirun) just stay, the remaining ranks at 100 % CPU waiting
> apparently in that allreduce. The stack trace looks a bit more
> interesting (git is always debug build ?), so I include it at the very
> bottom just in case.
>
> Off-list Gilles Gouaillardet suggested to set breakpoints at exit,
> __exit etc. to try to catch signals. Would that be useful ? I need a
> moment to figure out how to do this, but I can definitively try.
>
> Some remark: During "make install" from the git repo I see a
>
> WARNING!  Common symbols found:
>           mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_2complex
>           mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_2double_complex
>           mpi-f08-types.o: 0000000000000004 C
> ompi_f08_mpi_2double_precision
>           mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_2integer
>           mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_2real
>           mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_aint
>           mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_band
>           mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_bor
>           mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_bxor
>           mpi-f08-types.o: 0000000000000004 C ompi_f08_mpi_byte
>
> I have never noticed this before.
>
>
> Best Regards
>
> Christof
>
> Thread 1 (Thread 0x2af84cde4840 (LWP 11219)):
> #0  0x00002af84e4c669d in poll () from /lib64/libc.so.6
> #1  0x00002af850517496 in poll_dispatch () from /cluster/mpi/openmpi/2.0.2/
> intel2016/lib/libopen-pal.so.20
> #2  0x00002af85050ffa5 in opal_libevent2022_event_base_loop () from
> /cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-pal.so.20
> #3  0x00002af85049fa1f in opal_progress () at runtime/opal_progress.c:207
> #4  0x00002af84e02f7f7 in ompi_request_default_wait_all (count=233618144,
> requests=0x2, statuses=0x0) at ../opal/threads/wait_sync.h:80
> #5  0x00002af84e0758a7 in ompi_coll_base_allreduce_intra_recursivedoubling
> (sbuf=0xdecbae0,
> rbuf=0x2, count=0, dtype=0xffffffffffffffff, op=0x0, comm=0x1,
> module=0xdee69e0) at base/coll_base_allreduce.c:225
> #6  0x00002af84e07b747 in ompi_coll_tuned_allreduce_intra_dec_fixed
> (sbuf=0xdecbae0, rbuf=0x2, count=0, dtype=0xffffffffffffffff, op=0x0,
> comm=0x1, module=0x1) at coll_tuned_decision_fixed.c:66
> #7  0x00002af84e03e832 in PMPI_Allreduce (sendbuf=0xdecbae0, recvbuf=0x2,
> count=0, datatype=0xffffffffffffffff, op=0x0, comm=0x1) at pallreduce.c:107
> #8  0x00002af84ddaac90 in ompi_allreduce_f (sendbuf=0xdecbae0 "\005",
> recvbuf=0x2 <Address 0x2 out of bounds>, count=0x0,
> datatype=0xffffffffffffffff, op=0x0, comm=0x1, ierr=0x7ffdf3cffe9c) at
> pallreduce_f.c:87
> #9  0x000000000045ecc6 in m_sum_i_ ()
> #10 0x0000000000e172c9 in mlwf_mp_mlwf_wannier90_ ()
> #11 0x00000000004325ff in vamp () at main.F:2640
> #12 0x000000000040de1e in main ()
> #13 0x00002af84e3fbb15 in __libc_start_main () from /lib64/libc.so.6
> #14 0x000000000040dd29 in _start ()
>
> On Wed, Dec 07, 2016 at 09:47:48AM -0800, r...@open-mpi.org <javascript:;>
> wrote:
> > Hi Christof
> >
> > Sorry if I missed this, but it sounds like you are saying that one of
> your procs abnormally terminates, and we are failing to kill the remaining
> job? Is that correct?
> >
> > If so, I just did some work that might relate to that problem that is
> pending in PR #2528: https://github.com/open-mpi/ompi/pull/2528 <
> https://github.com/open-mpi/ompi/pull/2528>
> >
> > Would you be able to try that?
> >
> > Ralph
> >
> > > On Dec 7, 2016, at 9:37 AM, Christof Koehler <
> christof.koeh...@bccms.uni-bremen.de <javascript:;>> wrote:
> > >
> > > Hello,
> > >
> > > On Wed, Dec 07, 2016 at 10:19:10AM -0500, Noam Bernstein wrote:
> > >>> On Dec 7, 2016, at 10:07 AM, Christof Koehler <
> christof.koeh...@bccms.uni-bremen.de <javascript:;>> wrote:
> > >>>>
> > >>> I really think the hang is a consequence of
> > >>> unclean termination (in the sense that the non-root ranks are not
> > >>> terminated) and probably not the cause, in my interpretation of what
> I
> > >>> see. Would you have any suggestion to catch signals sent between
> orterun
> > >>> (mpirun) and the child tasks ?
> > >>
> > >> Do you know where in the code the termination call is?  Is it
> actually calling mpi_abort(), or just doing something ugly like calling
> fortran “stop”?  If the latter, would that explain a possible hang?
> > > Well, basically it tries to use wannier90 (LWANNIER=.TRUE.). The
> wannier90 input contains
> > > an error, a restart is requested and the wannier90.chk file the restart
> > > information is missing.
> > > "
> > > Exiting.......
> > > Error: restart requested but wannier90.chk file not found
> > > "
> > > So it must terminate.
> > >
> > > The termination happens in the libwannier.a, source file io.F90:
> > >
> > > write(stdout,*)  'Exiting.......'
> > > write(stdout, '(1x,a)') trim(error_msg)
> > > close(stdout)
> > > stop "wannier90 error: examine the output/error file for details"
> > >
> > > So it calls stop  as you assumed.
> > >
> > >> Presumably someone here can comment on what the standard says about
> the validity of terminating without mpi_abort.
> > >
> > > Well, probably stop is not a good way to terminate then.
> > >
> > > My main point was the change relative to 1.10 anyway :-)
> > >
> > >
> > >>
> > >> Actually, if you’re willing to share enough input files to reproduce,
> I could take a look.  I just recompiled our VASP with openmpi 2.0.1 to fix
> a crash that was apparently addressed by some change in the memory
> allocator in a recent version of openmpi.  Just e-mail me if that’s the
> case.
> > >
> > > I think that is no longer necessary ? In principle it is no problem but
> > > it at the end of a (small) GW calculation, the Si tutorial example.
> > > So the mail would be abit larger due to the WAVECAR.
> > >
> > >
> > >>
> > >>
> Noam
> > >>
> > >>
> > >> ____________
> > >> ||
> > >> |U.S. NAVAL|
> > >> |_RESEARCH_|
> > >> LABORATORY
> > >> Noam Bernstein, Ph.D.
> > >> Center for Materials Physics and Technology
> > >> U.S. Naval Research Laboratory
> > >> T +1 202 404 8628  F +1 202 404 7546
> > >> https://www.nrl.navy.mil <https://www.nrl.navy.mil/>
> > >
> > > --
> > > Dr. rer. nat. Christof Köhler       email:
> c.koeh...@bccms.uni-bremen.de <javascript:;>
> > > Universitaet Bremen/ BCCMS          phone:  +49-(0)421-218-62334
> > > Am Fallturm 1/ TAB/ Raum 3.12       fax: +49-(0)421-218-62770
> > > 28359 Bremen
> > >
> > > PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
> > > _______________________________________________
> > > users mailing list
> > > users@lists.open-mpi.org <javascript:;>
> > > https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> >
>
> --
> Dr. rer. nat. Christof Köhler       email: c.koeh...@bccms.uni-bremen.de
> <javascript:;>
> Universitaet Bremen/ BCCMS          phone:  +49-(0)421-218-62334
> Am Fallturm 1/ TAB/ Raum 3.12       fax: +49-(0)421-218-62770
> 28359 Bremen
>
> PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
>
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to