Re: [O-MPI users] deadlock and SEGV in collectives using btl gm and tcp

Jeff Squyres Tue, 30 Aug 2005 11:24:02 -0500

Greetings!

We actually had some problems in some of our collectives with someoptimizations that were added in the last month or so, and we justnoticed/corrected them yesterday. It looks like your tarball is abouta week old or so -- you might want to update to a newer one. Lastnight's tarball should include all the fixes that we made yesterday;I'm artifically making another one right now that includes some fixesfrom this morning.

Thanks for your patience; we're actually getting pretty close tostable, but aren't quite there yet...



On Aug 30, 2005, at 6:01 AM, Joachim Worringen wrote:

Dear *,
I'm currently testing OpenMPI 1.0a1r7026 on a Linux 2.6.6 32-nodeDual-Athloncluster with Myrinet (GM 2.1.1 on M3M-PCI64C boards). gcc is 3.3.3.4GB RAM per
node.
Compilation from the snapshot and startup went fine, congratulations.Surely not
trivial.
Point-to-point tests (mpptest) pass. However, running a rather simplebenchmarkto test the performance of collective operations (not PMB, but acustom one)
seems to deadlock. So far, I could figure out:
- using btl 'gm' (default)
   o 16 processes on 8 nodes: "deadlock" in Allreduce
   o 2 processes on 2 nodes: "deadlock" in Reduce_scatter
- explicitely using btl 'tcp'
   o 2 processes on 2 nodes: "deadlock" in Reduce_scatter

Additionally, I sporadically receive SEGV's using gm:
Core was generated by `collmeas_open-mpi'.
Program terminated with signal 11, Segmentation fault.
(gdb) bt
#0  0x00000000 in ?? ()
#1  0x4006d04c in mca_mpool_base_registration_destructor () from
/home/joachim/local/open-mpi/lib/libmpi.so.0
#2  0x40179a0c in mca_mpool_gm_free () from
/home/joachim/local/open-mpi//lib/openmpi/mca_mpool_gm.so
#3  0x4006cf9c in mca_mpool_base_free () from
/home/joachim/local/open-mpi/lib/libmpi.so.0
#4 0x4004efbc in PMPI_Free_mem () from/home/joachim/local/open-mpi/lib/libmpi.so.0
#5  0x0804b1c9 in main ()
Sometimes, this seems to happen when aborting an application (viaCTRL-C to mpirun):
Core was generated by `collmeas_open-mpi'.
Program terminated with signal 11, Segmentation fault.
(gdb) bt
#0  0x401d0633 in mca_btl_tcp_proc_remove () from
/home/joachim/local/open-mpi//lib/openmpi/mca_btl_tcp.so
Cannot access memory at address 0xbfffe2bc
Of course, I'm not sure if the deadlock really is a deadlock, but therespectivetests takes way to much time. Needless to say that other MPIimplementations runthis benchmark (which we are using for some time on a variety ofplatforms)
reliably on the same machine (MPICH-GM, our own MPI).

Any ideas or comments? I will try to run PMB.

  Joachim

--
Joachim Worringen - NEC C&C research lab St.Augustin
fon +49-2241-9252.20 - fax .99 - http://www.ccrl-nece.de

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/

Re: [O-MPI users] deadlock and SEGV in collectives using btl gm and tcp

Reply via email to