Hi,

Just to be clear, what specific version of Open MPI produced the provided 
backtrace?  This smells like a missing memory barrier problem.

--
Samuel K. Gutierrez
Los Alamos National Laboratory

On Mar 13, 2012, at 1:07 PM, Joshua Baker-LePain wrote:

> I run a decent size (600+ nodes, 4000+ cores) heterogeneous (multiple 
> generations of x86_64 hardware) cluster.  We use SGE (currently 6.1u4, which, 
> yes, is pretty ancient) and just upgraded from CentOS 5.7 to 6.2. We had been 
> using MPICH2 under CentOS 5, but I'd much rather use OpenMPI as packaged by 
> RH/CentOS.  Our SGE queues are setup with a high priority queue, running 
> un-niced, and a low priority queue running at nice 19, each with 1 slot per 
> core on every node.
> 
> I'm seeing consistent segfaults with OpenMPI when I submit jobs without 
> specifying a queue (meaning some threads run niced, others run un-niced). 
> This was initially reported to me by 2 users, each with their own code, but I 
> can reproduce it with my own very simple test program.  The segfaults occur 
> whether I'm using the default OpenMPI version of 1.5 or compat-openmpi-1.4.3. 
>  I'll note that I did upgrade the distro RPM of openmpi-1.5.3 to 1.5.4 to get 
> around the broken SGE integration 
> <https://bugzilla.redhat.com/show_bug.cgi?id=789150>.  I can't absolutely say 
> that jobs run entirely in the high priority queue do not segfault.  But, if 
> they do, it's not nearly as reproducible.  The segfaults also don't seem to 
> occur if a job runs entirely on one node.
> 
> The error logs of failed jobs contain a stanza like this for each thread 
> which segfaulted:
> [opt207:03766] *** Process received signal ***
> [opt207:03766] Signal: Segmentation fault (11)
> [opt207:03766] Signal code: Address not mapped (1)
> [opt207:03766] Failing at address: 0x2b4e279e778c
> [opt207:03766] [ 0] /lib64/libpthread.so.0() [0x37f940f4a0]
> [opt207:03766] [ 1] /usr/lib64/openmpi/lib/openmpi/mca_btl_sm.so(+0x42fc) 
> [0x2b17aa6002fc]
> [opt207:03766] [ 2] /usr/lib64/openmpi/lib/libmpi.so.1(opal_progress+0x5a) 
> [0x37fa0d1aba]
> [opt207:03766] [ 3] 
> /usr/lib64/openmpi/lib/openmpi/mca_grpcomm_bad.so(+0x24d5) [0x2b17a7d234d5]
> [opt207:03766] [ 4] /usr/lib64/openmpi/lib/libmpi.so.1() [0x37fa04bd57]
> [opt207:03766] [ 5] /usr/lib64/openmpi/lib/libmpi.so.1(MPI_Init+0x170) 
> [0x37fa063c70]
> [opt207:03766] [ 6] /netapp/sali/jlb/mybin/mpihello-long.ompi-1.5-debug() 
> [0x4006e6]
> [opt207:03766] [ 7] /lib64/libc.so.6(__libc_start_main+0xfd) [0x37f901ecdd]
> [opt207:03766] [ 8] /netapp/sali/jlb/mybin/mpihello-long.ompi-1.5-debug() 
> [0x400609]
> [opt207:03766] *** End of error message ***
> 
> A backtrace of the core file looks like this:
> #0  sm_fifo_read () at btl_sm.h:353
> #1  mca_btl_sm_component_progress () at btl_sm_component.c:588
> #2  0x00000037fa0d1aba in opal_progress () at runtime/opal_progress.c:207
> #3  0x00002b17a7d234d5 in barrier () at grpcomm_bad_module.c:277
> #4  0x00000037fa04bd57 in ompi_mpi_init (argc=1, argv=0x7fff253658f8,
>    requested=<value optimized out>, provided=<value optimized out>)
>    at runtime/ompi_mpi_init.c:771
> #5  0x00000037fa063c70 in PMPI_Init (argc=0x7fff253657fc, argv=0x7fff253657f0)
>    at pinit.c:84
> #6  0x00000000004006e6 in main (argc=1, argv=0x7fff253658f8)
>    at mpihello-long.c:11
> 
> Those are both from a test with 1.5.  The 1.4 errors are essentially 
> identical, with the differences mainly in line numbers.  I'm happy to post 
> full logs, but I'm trying (albeit unsuccessfully) to keep this from turning 
> into a novel.  I'm happy to do as much debugging as I can -- I'm pretty 
> motivated to get this working.
> 
> Thanks for any insights.
> 
> -- 
> Joshua Baker-LePain
> QB3 Shared Cluster Sysadmin
> UCSF
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to