On Tue, 13 Mar 2012 at 9:15pm, Gutierrez, Samuel K wrote

Any more information surrounding your failures in 1.5.4 are greatly appreciated.

I'm happy to provide, but what exactly are you looking for? The test code I'm running is *very* simple:

If you experience this type of failure with 1.4.5, can you send another backtrace? We'll go from there.

In an odd way I'm relieved to say that 1.4.5 failed in the same way. From the SGE log of the run, here's the error message from one of the threads that segfaulted:
[iq104:05697] *** Process received signal ***
[iq104:05697] Signal: Segmentation fault (11)
[iq104:05697] Signal code: Address not mapped (1)
[iq104:05697] Failing at address: 0x2ad032188e8c
[iq104:05697] [ 0] /lib64/libpthread.so.0() [0x3e5420f4a0]
[iq104:05697] [ 1] 
/netapp/sali/jlb/ompi-1.4.5/lib/openmpi/mca_btl_sm.so(+0x3c4c) [0x2b0099ec4c4c]
[iq104:05697] [ 2] 
/netapp/sali/jlb/ompi-1.4.5/lib/libopen-pal.so.0(opal_progress+0x6a) 
[0x2b00967737ca]
[iq104:05697] [ 3] 
/netapp/sali/jlb/ompi-1.4.5/lib/openmpi/mca_grpcomm_bad.so(+0x18d5) 
[0x2b00975ef8d5]
[iq104:05697] [ 4] /netapp/sali/jlb/ompi-1.4.5/lib/libmpi.so.0(+0x38a24) 
[0x2b009628da24]
[iq104:05697] [ 5] /netapp/sali/jlb/ompi-1.4.5/lib/libmpi.so.0(MPI_Init+0x1b0) 
[0x2b00962b24f0]
[iq104:05697] [ 6] 
/netapp/sali/jlb/mybin/mpihello-long.ompi-1.4-debug(main+0x22) [0x400826]
[iq104:05697] [ 7] /lib64/libc.so.6(__libc_start_main+0xfd) [0x3e53e1ecdd]
[iq104:05697] [ 8] /netapp/sali/jlb/mybin/mpihello-long.ompi-1.4-debug() 
[0x400749]
[iq104:05697] *** End of error message ***

And the backtrace of the resulting core file:
#0  0x00002b0099ec4c4c in mca_btl_sm_component_progress ()
   from /netapp/sali/jlb/ompi-1.4.5/lib/openmpi/mca_btl_sm.so
#1  0x00002b00967737ca in opal_progress ()
   from /netapp/sali/jlb/ompi-1.4.5/lib/libopen-pal.so.0
#2  0x00002b00975ef8d5 in barrier ()
   from /netapp/sali/jlb/ompi-1.4.5/lib/openmpi/mca_grpcomm_bad.so
#3  0x00002b009628da24 in ompi_mpi_init ()
   from /netapp/sali/jlb/ompi-1.4.5/lib/libmpi.so.0
#4  0x00002b00962b24f0 in PMPI_Init ()
   from /netapp/sali/jlb/ompi-1.4.5/lib/libmpi.so.0
#5  0x0000000000400826 in main (argc=1, argv=0x7fff9fe113f8)
    at mpihello-long.c:11

Another question.  How reproducible is this on your system?

In my testing today, it's been 100% reproducible.

--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF

Reply via email to