On Mar 13, 2012, at 4:07 PM, Joshua Baker-LePain wrote: > On Tue, 13 Mar 2012 at 9:15pm, Gutierrez, Samuel K wrote > >>>> Any more information surrounding your failures in 1.5.4 are greatly >>>> appreciated. >>> >>> I'm happy to provide, but what exactly are you looking for? The test code >>> I'm running is *very* simple: >> >> If you experience this type of failure with 1.4.5, can you send another >> backtrace? We'll go from there. >
Fooey. What compiler are you using to build Open MPI and how are you configuring your build? Can you also run with a debug build of Open MPI so we can see the line numbers? > In an odd way I'm relieved to say that 1.4.5 failed in the same way. From > the SGE log of the run, here's the error message from one of the threads that > segfaulted: > [iq104:05697] *** Process received signal *** > [iq104:05697] Signal: Segmentation fault (11) > [iq104:05697] Signal code: Address not mapped (1) > [iq104:05697] Failing at address: 0x2ad032188e8c > [iq104:05697] [ 0] /lib64/libpthread.so.0() [0x3e5420f4a0] > [iq104:05697] [ 1] > /netapp/sali/jlb/ompi-1.4.5/lib/openmpi/mca_btl_sm.so(+0x3c4c) > [0x2b0099ec4c4c] > [iq104:05697] [ 2] > /netapp/sali/jlb/ompi-1.4.5/lib/libopen-pal.so.0(opal_progress+0x6a) > [0x2b00967737ca] > [iq104:05697] [ 3] > /netapp/sali/jlb/ompi-1.4.5/lib/openmpi/mca_grpcomm_bad.so(+0x18d5) > [0x2b00975ef8d5] > [iq104:05697] [ 4] /netapp/sali/jlb/ompi-1.4.5/lib/libmpi.so.0(+0x38a24) > [0x2b009628da24] > [iq104:05697] [ 5] > /netapp/sali/jlb/ompi-1.4.5/lib/libmpi.so.0(MPI_Init+0x1b0) [0x2b00962b24f0] > [iq104:05697] [ 6] > /netapp/sali/jlb/mybin/mpihello-long.ompi-1.4-debug(main+0x22) [0x400826] > [iq104:05697] [ 7] /lib64/libc.so.6(__libc_start_main+0xfd) [0x3e53e1ecdd] > [iq104:05697] [ 8] /netapp/sali/jlb/mybin/mpihello-long.ompi-1.4-debug() > [0x400749] > [iq104:05697] *** End of error message *** > > And the backtrace of the resulting core file: > #0 0x00002b0099ec4c4c in mca_btl_sm_component_progress () > from /netapp/sali/jlb/ompi-1.4.5/lib/openmpi/mca_btl_sm.so > #1 0x00002b00967737ca in opal_progress () > from /netapp/sali/jlb/ompi-1.4.5/lib/libopen-pal.so.0 > #2 0x00002b00975ef8d5 in barrier () > from /netapp/sali/jlb/ompi-1.4.5/lib/openmpi/mca_grpcomm_bad.so > #3 0x00002b009628da24 in ompi_mpi_init () > from /netapp/sali/jlb/ompi-1.4.5/lib/libmpi.so.0 > #4 0x00002b00962b24f0 in PMPI_Init () > from /netapp/sali/jlb/ompi-1.4.5/lib/libmpi.so.0 > #5 0x0000000000400826 in main (argc=1, argv=0x7fff9fe113f8) > at mpihello-long.c:11 > >> Another question. How reproducible is this on your system? > > In my testing today, it's been 100% reproducible. That's surprising. Thanks, Sam > > -- > Joshua Baker-LePain > QB3 Shared Cluster Sysadmin > UCSF > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users