Greetings!

We actually had some problems in some of our collectives with some optimizations that were added in the last month or so, and we just noticed/corrected them yesterday. It looks like your tarball is about a week old or so -- you might want to update to a newer one. Last night's tarball should include all the fixes that we made yesterday; I'm artifically making another one right now that includes some fixes from this morning.

Thanks for your patience; we're actually getting pretty close to stable, but aren't quite there yet...


On Aug 30, 2005, at 6:01 AM, Joachim Worringen wrote:


Dear *,

I'm currently testing OpenMPI 1.0a1r7026 on a Linux 2.6.6 32-node Dual-Athlon cluster with Myrinet (GM 2.1.1 on M3M-PCI64C boards). gcc is 3.3.3. 4GB RAM per
node.

Compilation from the snapshot and startup went fine, congratulations. Surely not
trivial.

Point-to-point tests (mpptest) pass. However, running a rather simple benchmark to test the performance of collective operations (not PMB, but a custom one)
seems to deadlock. So far, I could figure out:
- using btl 'gm' (default)
   o 16 processes on 8 nodes: "deadlock" in Allreduce
   o 2 processes on 2 nodes: "deadlock" in Reduce_scatter
- explicitely using btl 'tcp'
   o 2 processes on 2 nodes: "deadlock" in Reduce_scatter

Additionally, I sporadically receive SEGV's using gm:
Core was generated by `collmeas_open-mpi'.
Program terminated with signal 11, Segmentation fault.
(gdb) bt
#0  0x00000000 in ?? ()
#1  0x4006d04c in mca_mpool_base_registration_destructor () from
/home/joachim/local/open-mpi/lib/libmpi.so.0
#2  0x40179a0c in mca_mpool_gm_free () from
/home/joachim/local/open-mpi//lib/openmpi/mca_mpool_gm.so
#3  0x4006cf9c in mca_mpool_base_free () from
/home/joachim/local/open-mpi/lib/libmpi.so.0
#4 0x4004efbc in PMPI_Free_mem () from /home/joachim/local/open-mpi/lib/libmpi.so.0
#5  0x0804b1c9 in main ()

Sometimes, this seems to happen when aborting an application (via CTRL-C to mpirun):
Core was generated by `collmeas_open-mpi'.
Program terminated with signal 11, Segmentation fault.
(gdb) bt
#0  0x401d0633 in mca_btl_tcp_proc_remove () from
/home/joachim/local/open-mpi//lib/openmpi/mca_btl_tcp.so
Cannot access memory at address 0xbfffe2bc

Of course, I'm not sure if the deadlock really is a deadlock, but the respective tests takes way to much time. Needless to say that other MPI implementations run this benchmark (which we are using for some time on a variety of platforms)
reliably on the same machine (MPICH-GM, our own MPI).

Any ideas or comments? I will try to run PMB.

  Joachim

--
Joachim Worringen - NEC C&C research lab St.Augustin
fon +49-2241-9252.20 - fax .99 - http://www.ccrl-nece.de

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/

Reply via email to