On Tue, 13 Mar 2012 at 10:57pm, Gutierrez, Samuel K wrote

Fooey. What compiler are you using to build Open MPI and how are you configuring your build?

I'm using gcc as packaged by RH/CentOS 6.2:

[jlb@opt200 1.4.5-2]$ gcc --version
gcc (GCC) 4.4.6 20110731 (Red Hat 4.4.6-3)

I actually tried 2 custom builds of Open MPI 1.4.5. For the first I tried to stick close to the options in RH's compat-openmpi SRPM:

./configure --prefix=$HOME/ompi-1.4.5 --enable-mpi-threads --enable-openib-ibcm 
--with-sge --with-libltdl=external --with-valgrind --enable-memchecker 
--with-psm=no --with-esmtp LDFLAGS='-Wl,-z,noexecstack'

That resulted in the backtrace I sent previously:
#0  0x00002b0099ec4c4c in mca_btl_sm_component_progress ()
   from /netapp/sali/jlb/ompi-1.4.5/lib/openmpi/mca_btl_sm.so
#1  0x00002b00967737ca in opal_progress ()
   from /netapp/sali/jlb/ompi-1.4.5/lib/libopen-pal.so.0
#2  0x00002b00975ef8d5 in barrier ()
   from /netapp/sali/jlb/ompi-1.4.5/lib/openmpi/mca_grpcomm_bad.so
#3  0x00002b009628da24 in ompi_mpi_init ()
   from /netapp/sali/jlb/ompi-1.4.5/lib/libmpi.so.0
#4  0x00002b00962b24f0 in PMPI_Init ()
   from /netapp/sali/jlb/ompi-1.4.5/lib/libmpi.so.0
#5  0x0000000000400826 in main (argc=1, argv=0x7fff9fe113f8)
    at mpihello-long.c:11

For kicks, I tried a 2nd compile of 1.4.5 with a bare minimum of options:

./configure --prefix=$HOME/ompi-1.4.5 --with-sge

That resulted in a slightly different backtrace that seems to be missing a bit:
#0  0x00002b7bbc8681d0 in ?? ()
#1  <signal handler called>
#2  0x00002b7bbd2b8f6c in mca_btl_sm_component_progress ()
   from /netapp/sali/jlb/ompi-1.4.5/lib/openmpi/mca_btl_sm.so
#3  0x00002b7bb9b2feda in opal_progress ()
   from /netapp/sali/jlb/ompi-1.4.5/lib/libopen-pal.so.0
#4  0x00002b7bba9a98d5 in barrier ()
   from /netapp/sali/jlb/ompi-1.4.5/lib/openmpi/mca_grpcomm_bad.so
#5  0x00002b7bb965d426 in ompi_mpi_init ()
   from /netapp/sali/jlb/ompi-1.4.5/lib/libmpi.so.0
#6  0x00002b7bb967cba0 in PMPI_Init ()
   from /netapp/sali/jlb/ompi-1.4.5/lib/libmpi.so.0
#7  0x0000000000400826 in main (argc=1, argv=0x7fff93634788)
    at mpihello-long.c:11

Can you also run with a debug build of Open MPI so we can see the line numbers?

I'll do that first thing tomorrow.

Another question.  How reproducible is this on your system?

In my testing today, it's been 100% reproducible.

That's surprising.

Heh.  You're telling me.

Thanks for taking an interest in this.

--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF

Reply via email to