On Tue, 13 Mar 2012 at 9:15pm, Gutierrez, Samuel K wrote
Any more information surrounding your failures in 1.5.4 are greatly
appreciated.
I'm happy to provide, but what exactly are you looking for? The test
code I'm running is *very* simple:
If you experience this type of failure with 1.4.5, can you send another
backtrace? We'll go from there.
In an odd way I'm relieved to say that 1.4.5 failed in the same way. From
the SGE log of the run, here's the error message from one of the threads
that segfaulted:
[iq104:05697] *** Process received signal ***
[iq104:05697] Signal: Segmentation fault (11)
[iq104:05697] Signal code: Address not mapped (1)
[iq104:05697] Failing at address: 0x2ad032188e8c
[iq104:05697] [ 0] /lib64/libpthread.so.0() [0x3e5420f4a0]
[iq104:05697] [ 1]
/netapp/sali/jlb/ompi-1.4.5/lib/openmpi/mca_btl_sm.so(+0x3c4c) [0x2b0099ec4c4c]
[iq104:05697] [ 2]
/netapp/sali/jlb/ompi-1.4.5/lib/libopen-pal.so.0(opal_progress+0x6a)
[0x2b00967737ca]
[iq104:05697] [ 3]
/netapp/sali/jlb/ompi-1.4.5/lib/openmpi/mca_grpcomm_bad.so(+0x18d5)
[0x2b00975ef8d5]
[iq104:05697] [ 4] /netapp/sali/jlb/ompi-1.4.5/lib/libmpi.so.0(+0x38a24)
[0x2b009628da24]
[iq104:05697] [ 5] /netapp/sali/jlb/ompi-1.4.5/lib/libmpi.so.0(MPI_Init+0x1b0)
[0x2b00962b24f0]
[iq104:05697] [ 6]
/netapp/sali/jlb/mybin/mpihello-long.ompi-1.4-debug(main+0x22) [0x400826]
[iq104:05697] [ 7] /lib64/libc.so.6(__libc_start_main+0xfd) [0x3e53e1ecdd]
[iq104:05697] [ 8] /netapp/sali/jlb/mybin/mpihello-long.ompi-1.4-debug()
[0x400749]
[iq104:05697] *** End of error message ***
And the backtrace of the resulting core file:
#0 0x00002b0099ec4c4c in mca_btl_sm_component_progress ()
from /netapp/sali/jlb/ompi-1.4.5/lib/openmpi/mca_btl_sm.so
#1 0x00002b00967737ca in opal_progress ()
from /netapp/sali/jlb/ompi-1.4.5/lib/libopen-pal.so.0
#2 0x00002b00975ef8d5 in barrier ()
from /netapp/sali/jlb/ompi-1.4.5/lib/openmpi/mca_grpcomm_bad.so
#3 0x00002b009628da24 in ompi_mpi_init ()
from /netapp/sali/jlb/ompi-1.4.5/lib/libmpi.so.0
#4 0x00002b00962b24f0 in PMPI_Init ()
from /netapp/sali/jlb/ompi-1.4.5/lib/libmpi.so.0
#5 0x0000000000400826 in main (argc=1, argv=0x7fff9fe113f8)
at mpihello-long.c:11
Another question. How reproducible is this on your system?
In my testing today, it's been 100% reproducible.
--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF