Re: [OMPI users] Abort/ Deadlock issue in allreduce

Christof Koehler Wed, 07 Dec 2016 07:10:39 -0800

Hello,

On Wed, Dec 07, 2016 at 11:07:49PM +0900, Gilles Gouaillardet wrote:
> Christof,
> 
> out of curiosity, can you run
> dmesg
> and see if you find some tasks killed by the oom-killer ?
Definitively not the oom-killer. It is a real tiny example. I checked
the machines logfile and dmesg.


> 
> the error message you see is a consequence of a task unexpectedly died.
> and there is no evidence the task crashed or was killed.
Yes, confusing isn't it ? 

> 
> when you observe a hang with two tasks, you can
> - retrieve the pids with ps
> - run 'pstack <pid>' on both pids in order to collect the stacktrace.
When it hangs one is already gone ! The pstack traces I sent are from the
surviver(s). It is not terminating completely as it should do.

> 
> assuming they both hang in MPI_Allreduce(), the relevant part to us is
> - datatype (MPI_INT)
> - count (n)
> - communicator (COMM%MPI_COMM) (size, check this is the same communicator
> used by all tasks)
> - is all the buffer accessible (ivec(1:n))

As I said, the root rank terminates (according to gdb normally). The other
remains and hangs in allreduce. Possibly because its partner (the
root rank) is gone without saying goodbye properly.

This is not a real hang IMO, but a failure to terminate all ranks cleanly.

I really think the hang is a consequence of
unclean termination (in the sense that the non-root ranks are not
terminated) and probably not the cause, in my interpretation of what I
see. Would you have any suggestion to catch signals sent between orterun
(mpirun) and the child tasks ?

I will try to get the information you want, but I will have to figure
out how to do that first. 

Cheers

Christof

> 
> Cheers,
> 
> Gilles
> 
> On Wednesday, December 7, 2016, Christof Koehler <
> christof.koeh...@bccms.uni-bremen.de> wrote:
> 
> > Hello,
> >
> > thank you for the fast answer.
> >
> > On Wed, Dec 07, 2016 at 08:23:43PM +0900, Gilles Gouaillardet wrote:
> > > Christoph,
> > >
> > > can you please try again with
> > >
> > > mpirun --mca btl tcp,self --mca pml ob1 ...
> >
> > mpirun -n 20 --mca btl tcp,self --mca pml ob1
> > /cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi
> >
> > Deadlocks/ hangs, has no effect.
> >
> > > mpirun --mca btl tcp,self --mca pml ob1 --mca coll ^tuned ...
> > mpirun -n 20 --mca btl tcp,self --mca pml ob1 --mca coll ^tuned
> > /cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi
> >
> > Deadlocks/ hangs, has no effect. There is additional output.
> >
> > wannier90 error: examine the output/error file for details
> > [node109][[55572,1],16][btl_tcp_frag.c:230:mca_btl_tcp_frag_recv]
> > mca_btl_tcp_frag_recv: readv failed: Connection reset by peer
> > (104)[node109][[55572,1],8][btl_tcp_frag.c:230:mca_btl_tcp_frag_recv]
> > mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> > [node109][[55572,1],4][btl_tcp_frag.c:230:mca_btl_tcp_frag_recv]
> > mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> > [node109][[55572,1],1][btl_tcp_frag.c:230:mca_btl_tcp_frag_recv]
> > mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> > [node109][[55572,1],2][btl_tcp_frag.c:230:mca_btl_tcp_frag_recv]
> > mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> >
> > Please note: The "wannier90 error: examine the output/error file for
> > details" is expected, there is in fact an error in the input file. It
> > is supposed to terminate.
> >
> > However, with mvapich2 and openmpi 1.10.4 it terminates
> > completely, i.e. I get my shell prompt back. If a segfault is involved with
> > mvapich2 (as is apparently the case with openmpi 1.10.4 based in the
> > termination message) I do not know. I tried
> >
> > export MV2_DEBUG_SHOW_BACKTRACE=1
> > mpirun -n 20  /cluster/vasp/5.3.5/intel2016/mvapich2-2.2/bin/vasp-mpi
> >
> > but did not get any indication of a problem (segfault), the last lines
> > are
> >
> >  calculate QP shifts <psi_nk| G(iteration)W_0 |psi_nk>: iteration 1
> >  writing wavefunctions
> > wannier90 error: examine the output/error file for details
> > node109 14:00 /scratch/ckoe/gw %
> >
> > The last line is my shell prompt.
> >
> > >
> > > if everything fails, can you describe of MPI_Allreduce is invoked ?
> > > /* number of tasks, datatype, number of elements */
> > Difficult, this is not our code in the first place [1] and the problem
> > occurs when using an ("officially" supported) third party library [2].
> >
> > From the stack trace of the hanging process the vasp routine which calls
> > allreduce is "m_sum_i_". That is in the mpi.F source file. Allreduce is
> > called as
> >
> > CALL MPI_ALLREDUCE( MPI_IN_PLACE, ivec(1), n, MPI_INTEGER, &
> >          &                MPI_SUM, COMM%MPI_COMM, ierror )
> >
> > n and ivec(1) are data type integer. It was originally with 20 ranks, I
> > tried 2 ranks now also and it hangs, too. With one (!) rank
> >
> > mpirun -n 1 --mca btl tcp,self --mca pml ob1 --mca coll ^tuned
> > /cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi
> >
> > I of course get a shell prompt back.
> >
> > I then started in normally in the shell with 2 ranks
> > mpirun -n 2 --mca btl tcp,self --mca pml ob1 --mca coll ^tuned
> > /cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi
> > and attached gdb to the rank with the lowest pid (3478). I do not get a
> > prompt
> > back (it hangs), the second rank 3479 is still at 100 % CPU and mpirun is
> > still a process
> > I can see with "ps", but gdb says
> > (gdb) continue     <- that is where I attached it !
> > Continuing.
> > [Thread 0x2b8366806700 (LWP 3480) exited]
> > [Thread 0x2b835da1c040 (LWP 3478) exited]
> > [Inferior 1 (process 3478) exited normally]
> > (gdb) bt
> > No stack.
> >
> > So, as far as gdb is concerned the rank with the lowest pid (which is
> > gone while the other rank is still eating CPU time) terminated normally
> > ?
> >
> > I hope this helps. I have only very basic experience with debuggers
> > (never needed them really) and even less with using them in parallel.
> > I can try to catch the contents of ivec, but I do not think that would
> > be helpful ? If you need them I can try of course, I have no idea hwo
> > large the vector is.
> >
> >
> > Best Regards
> >
> > Christof
> >
> > [1] https://www.vasp.at/
> > [2] http://www.wannier.org/, Old version 1.2
> > >
> > >
> > >
> > > Cheers,
> > >
> > > Gilles
> > >
> > > On Wed, Dec 7, 2016 at 7:38 PM, Christof Koehler
> > > <christof.koeh...@bccms.uni-bremen.de <javascript:;>> wrote:
> > > > Hello everybody,
> > > >
> > > > I am observing a deadlock in allreduce with openmpi 2.0.1 on a Single
> > > > node. A stack tracke (pstack) of one rank is below showing the program
> > (vasp
> > > > 5.3.5) and the two psm2 progress threads. However:
> > > >
> > > > In fact, the vasp input is not ok and it should abort at the point
> > where
> > > > it hangs. It does when using mvapich 2.2. With openmpi 2.0.1 it just
> > > > deadlocks in some allreduce operation. Originally it was started with
> > 20
> > > > ranks, when it hangs there are only 19 left. From the PIDs I would
> > > > assume it is the master rank which is missing. So, this looks like a
> > > > failure to terminate.
> > > >
> > > > With 1.10 I get a clean
> > > > ------------------------------------------------------------
> > --------------
> > > > mpiexec noticed that process rank 0 with PID 18789 on node node109
> > > > exited on signal 11 (Segmentation fault).
> > > > ------------------------------------------------------------
> > --------------
> > > >
> > > > Any ideas what to try ? Of course in this situation it may well be the
> > > > program. Still, with the observed difference between 2.0.1 and 1.10
> > (and
> > > > mvapich) this might be interesting to someone.
> > > >
> > > > Best Regards
> > > >
> > > > Christof
> > > >
> > > >
> > > > Thread 3 (Thread 0x2ad362577700 (LWP 4629)):
> > > > #0  0x00002ad35b1562c3 in epoll_wait () from /lib64/libc.so.6
> > > > #1  0x00002ad35d114f42 in epoll_dispatch () from
> > /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
> > > > #2  0x00002ad35d116751 in opal_libevent2022_event_base_loop () from
> > /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
> > > > #3  0x00002ad35d16e996 in progress_engine () from
> > /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
> > > > #4  0x00002ad359efbdc5 in start_thread () from /lib64/libpthread.so.0
> > > > #5  0x00002ad35b155ced in clone () from /lib64/libc.so.6
> > > > Thread 2 (Thread 0x2ad362778700 (LWP 4640)):
> > > > #0  0x00002ad35b14b69d in poll () from /lib64/libc.so.6 #1
> > 0x00002ad35d11dc42 in poll_dispatch () from
> > > > /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
> > > > #2  0x00002ad35d116751 in opal_libevent2022_event_base_loop () from
> > /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
> > > > #3  0x00002ad35d0c61d1 in progress_engine () from
> > /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
> > > > #4  0x00002ad359efbdc5 in start_thread () from /lib64/libpthread.so.0
> > > > #5  0x00002ad35b155ced in clone () from /lib64/libc.so.6
> > > > Thread 1 (Thread 0x2ad35978d040 (LWP 4609)):
> > > > #0  0x00002ad35b14b69d in poll () from /lib64/libc.so.6
> > > > #1  0x00002ad35d11dc42 in poll_dispatch () from
> > /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
> > > > #2  0x00002ad35d116751 in opal_libevent2022_event_base_loop () from
> > /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
> > > > #3  0x00002ad35d0c28cf in opal_progress () from
> > /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
> > > > #4  0x00002ad35adce8d8 in ompi_request_wait_completion () from
> > /cluster/mpi/openmpi/2.0.1/intel2016/lib/libmpi.so.20
> > > > #5  0x00002ad35adce838 in mca_pml_cm_recv () from
> > /cluster/mpi/openmpi/2.0.1/intel2016/lib/libmpi.so.20
> > > > #6  0x00002ad35ad4da42 in 
> > > > ompi_coll_base_allreduce_intra_recursivedoubling
> > () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libmpi.so.20
> > > > #7  0x00002ad35ad52906 in ompi_coll_tuned_allreduce_intra_dec_fixed
> > () from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libmpi.so.20
> > > > #8  0x00002ad35ad1f0f4 in PMPI_Allreduce () from
> > /cluster/mpi/openmpi/2.0.1/intel2016/lib/libmpi.so.20
> > > > #9  0x00002ad35aa99c38 in pmpi_allreduce__ () from
> > /cluster/mpi/openmpi/2.0.1/intel2016/lib/libmpi_mpifh.so.20
> > > > #10 0x000000000045f8c6 in m_sum_i_ ()
> > > > #11 0x0000000000e1ce69 in mlwf_mp_mlwf_wannier90_ ()
> > > > #12 0x00000000004331ff in vamp () at main.F:2640
> > > > #13 0x000000000040ea1e in main ()
> > > > #14 0x00002ad35b080b15 in __libc_start_main () from /lib64/libc.so.6
> > > > #15 0x000000000040e929 in _start ()
> > > >
> > > >
> > > > --
> > > > Dr. rer. nat. Christof Köhler       email:
> > c.koeh...@bccms.uni-bremen.de <javascript:;>
> > > > Universitaet Bremen/ BCCMS          phone:  +49-(0)421-218-62334
> > > > Am Fallturm 1/ TAB/ Raum 3.12       fax: +49-(0)421-218-62770
> > > > 28359 Bremen
> > > >
> > > > PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
> > > >
> > > > _______________________________________________
> > > > users mailing list
> > > > users@lists.open-mpi.org <javascript:;>
> > > > https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> >
> > --
> > Dr. rer. nat. Christof Köhler       email: c.koeh...@bccms.uni-bremen.de
> > <javascript:;>
> > Universitaet Bremen/ BCCMS          phone:  +49-(0)421-218-62334
> > Am Fallturm 1/ TAB/ Raum 3.12       fax: +49-(0)421-218-62770
> > 28359 Bremen
> >
> > PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
> >

-- 
Dr. rer. nat. Christof Köhler       email: c.koeh...@bccms.uni-bremen.de
Universitaet Bremen/ BCCMS          phone:  +49-(0)421-218-62334
Am Fallturm 1/ TAB/ Raum 3.12       fax: +49-(0)421-218-62770
28359 Bremen  

PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/

signature.asc
Description: Digital signature

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Abort/ Deadlock issue in allreduce

Reply via email to