On Mar 13, 2012, at 4:07 PM, Joshua Baker-LePain wrote:

> On Tue, 13 Mar 2012 at 9:15pm, Gutierrez, Samuel K wrote
> 
>>>> Any more information surrounding your failures in 1.5.4 are greatly 
>>>> appreciated.
>>> 
>>> I'm happy to provide, but what exactly are you looking for?  The test code 
>>> I'm running is *very* simple:
>> 
>> If you experience this type of failure with 1.4.5, can you send another 
>> backtrace?  We'll go from there.
> 

Fooey.  What compiler are you using to build Open MPI and how are you 
configuring your build?  Can you also run with a debug build of Open MPI so we 
can see the line numbers?

> In an odd way I'm relieved to say that 1.4.5 failed in the same way.  From 
> the SGE log of the run, here's the error message from one of the threads that 
> segfaulted:
> [iq104:05697] *** Process received signal ***
> [iq104:05697] Signal: Segmentation fault (11)
> [iq104:05697] Signal code: Address not mapped (1)
> [iq104:05697] Failing at address: 0x2ad032188e8c
> [iq104:05697] [ 0] /lib64/libpthread.so.0() [0x3e5420f4a0]
> [iq104:05697] [ 1] 
> /netapp/sali/jlb/ompi-1.4.5/lib/openmpi/mca_btl_sm.so(+0x3c4c) 
> [0x2b0099ec4c4c]
> [iq104:05697] [ 2] 
> /netapp/sali/jlb/ompi-1.4.5/lib/libopen-pal.so.0(opal_progress+0x6a) 
> [0x2b00967737ca]
> [iq104:05697] [ 3] 
> /netapp/sali/jlb/ompi-1.4.5/lib/openmpi/mca_grpcomm_bad.so(+0x18d5) 
> [0x2b00975ef8d5]
> [iq104:05697] [ 4] /netapp/sali/jlb/ompi-1.4.5/lib/libmpi.so.0(+0x38a24) 
> [0x2b009628da24]
> [iq104:05697] [ 5] 
> /netapp/sali/jlb/ompi-1.4.5/lib/libmpi.so.0(MPI_Init+0x1b0) [0x2b00962b24f0]
> [iq104:05697] [ 6] 
> /netapp/sali/jlb/mybin/mpihello-long.ompi-1.4-debug(main+0x22) [0x400826]
> [iq104:05697] [ 7] /lib64/libc.so.6(__libc_start_main+0xfd) [0x3e53e1ecdd]
> [iq104:05697] [ 8] /netapp/sali/jlb/mybin/mpihello-long.ompi-1.4-debug() 
> [0x400749]
> [iq104:05697] *** End of error message ***
> 
> And the backtrace of the resulting core file:
> #0  0x00002b0099ec4c4c in mca_btl_sm_component_progress ()
>   from /netapp/sali/jlb/ompi-1.4.5/lib/openmpi/mca_btl_sm.so
> #1  0x00002b00967737ca in opal_progress ()
>   from /netapp/sali/jlb/ompi-1.4.5/lib/libopen-pal.so.0
> #2  0x00002b00975ef8d5 in barrier ()
>   from /netapp/sali/jlb/ompi-1.4.5/lib/openmpi/mca_grpcomm_bad.so
> #3  0x00002b009628da24 in ompi_mpi_init ()
>   from /netapp/sali/jlb/ompi-1.4.5/lib/libmpi.so.0
> #4  0x00002b00962b24f0 in PMPI_Init ()
>   from /netapp/sali/jlb/ompi-1.4.5/lib/libmpi.so.0
> #5  0x0000000000400826 in main (argc=1, argv=0x7fff9fe113f8)
>    at mpihello-long.c:11
> 
>> Another question.  How reproducible is this on your system?
> 
> In my testing today, it's been 100% reproducible.

That's surprising.

Thanks,

Sam

> 
> -- 
> Joshua Baker-LePain
> QB3 Shared Cluster Sysadmin
> UCSF
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to