Greetings!
We actually had some problems in some of our collectives with some
optimizations that were added in the last month or so, and we just
noticed/corrected them yesterday. It looks like your tarball is about
a week old or so -- you might want to update to a newer one. Last
night's tarball should include all the fixes that we made yesterday;
I'm artifically making another one right now that includes some fixes
from this morning.
Thanks for your patience; we're actually getting pretty close to
stable, but aren't quite there yet...
On Aug 30, 2005, at 6:01 AM, Joachim Worringen wrote:
Dear *,
I'm currently testing OpenMPI 1.0a1r7026 on a Linux 2.6.6 32-node
Dual-Athlon
cluster with Myrinet (GM 2.1.1 on M3M-PCI64C boards). gcc is 3.3.3.
4GB RAM per
node.
Compilation from the snapshot and startup went fine, congratulations.
Surely not
trivial.
Point-to-point tests (mpptest) pass. However, running a rather simple
benchmark
to test the performance of collective operations (not PMB, but a
custom one)
seems to deadlock. So far, I could figure out:
- using btl 'gm' (default)
o 16 processes on 8 nodes: "deadlock" in Allreduce
o 2 processes on 2 nodes: "deadlock" in Reduce_scatter
- explicitely using btl 'tcp'
o 2 processes on 2 nodes: "deadlock" in Reduce_scatter
Additionally, I sporadically receive SEGV's using gm:
Core was generated by `collmeas_open-mpi'.
Program terminated with signal 11, Segmentation fault.
(gdb) bt
#0 0x00000000 in ?? ()
#1 0x4006d04c in mca_mpool_base_registration_destructor () from
/home/joachim/local/open-mpi/lib/libmpi.so.0
#2 0x40179a0c in mca_mpool_gm_free () from
/home/joachim/local/open-mpi//lib/openmpi/mca_mpool_gm.so
#3 0x4006cf9c in mca_mpool_base_free () from
/home/joachim/local/open-mpi/lib/libmpi.so.0
#4 0x4004efbc in PMPI_Free_mem () from
/home/joachim/local/open-mpi/lib/libmpi.so.0
#5 0x0804b1c9 in main ()
Sometimes, this seems to happen when aborting an application (via
CTRL-C to mpirun):
Core was generated by `collmeas_open-mpi'.
Program terminated with signal 11, Segmentation fault.
(gdb) bt
#0 0x401d0633 in mca_btl_tcp_proc_remove () from
/home/joachim/local/open-mpi//lib/openmpi/mca_btl_tcp.so
Cannot access memory at address 0xbfffe2bc
Of course, I'm not sure if the deadlock really is a deadlock, but the
respective
tests takes way to much time. Needless to say that other MPI
implementations run
this benchmark (which we are using for some time on a variety of
platforms)
reliably on the same machine (MPICH-GM, our own MPI).
Any ideas or comments? I will try to run PMB.
Joachim
--
Joachim Worringen - NEC C&C research lab St.Augustin
fon +49-2241-9252.20 - fax .99 - http://www.ccrl-nece.de
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/