[OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE

Joshua Baker-LePain Tue, 13 Mar 2012 15:07:05 -0400

I run a decent size (600+ nodes, 4000+ cores) heterogeneous (multiplegenerations of x86_64 hardware) cluster. We use SGE (currently 6.1u4,which, yes, is pretty ancient) and just upgraded from CentOS 5.7 to 6.2.We had been using MPICH2 under CentOS 5, but I'd much rather use OpenMPIas packaged by RH/CentOS. Our SGE queues are setup with a high priorityqueue, running un-niced, and a low priority queue running at nice 19, eachwith 1 slot per core on every node.

I'm seeing consistent segfaults with OpenMPI when I submit jobs withoutspecifying a queue (meaning some threads run niced, others run un-niced).This was initially reported to me by 2 users, each with their own code,but I can reproduce it with my own very simple test program. Thesegfaults occur whether I'm using the default OpenMPI version of 1.5 orcompat-openmpi-1.4.3. I'll note that I did upgrade the distro RPM ofopenmpi-1.5.3 to 1.5.4 to get around the broken SGE integration<https://bugzilla.redhat.com/show_bug.cgi?id=789150>. I can'tabsolutely say that jobs run entirely in the high priority queue donot segfault. But, if they do, it's not nearly as reproducible. Thesegfaults also don't seem to occur if a job runs entirely on one node.

The error logs of failed jobs contain a stanza like this for each threadwhich segfaulted:

[opt207:03766] *** Process received signal ***
[opt207:03766] Signal: Segmentation fault (11)
[opt207:03766] Signal code: Address not mapped (1)
[opt207:03766] Failing at address: 0x2b4e279e778c
[opt207:03766] [ 0] /lib64/libpthread.so.0() [0x37f940f4a0]
[opt207:03766] [ 1] /usr/lib64/openmpi/lib/openmpi/mca_btl_sm.so(+0x42fc) 
[0x2b17aa6002fc]
[opt207:03766] [ 2] /usr/lib64/openmpi/lib/libmpi.so.1(opal_progress+0x5a) 
[0x37fa0d1aba]
[opt207:03766] [ 3] /usr/lib64/openmpi/lib/openmpi/mca_grpcomm_bad.so(+0x24d5) 
[0x2b17a7d234d5]
[opt207:03766] [ 4] /usr/lib64/openmpi/lib/libmpi.so.1() [0x37fa04bd57]
[opt207:03766] [ 5] /usr/lib64/openmpi/lib/libmpi.so.1(MPI_Init+0x170) 
[0x37fa063c70]
[opt207:03766] [ 6] /netapp/sali/jlb/mybin/mpihello-long.ompi-1.5-debug() 
[0x4006e6]
[opt207:03766] [ 7] /lib64/libc.so.6(__libc_start_main+0xfd) [0x37f901ecdd]
[opt207:03766] [ 8] /netapp/sali/jlb/mybin/mpihello-long.ompi-1.5-debug() 
[0x400609]
[opt207:03766] *** End of error message ***

A backtrace of the core file looks like this:
#0  sm_fifo_read () at btl_sm.h:353
#1  mca_btl_sm_component_progress () at btl_sm_component.c:588
#2  0x00000037fa0d1aba in opal_progress () at runtime/opal_progress.c:207
#3  0x00002b17a7d234d5 in barrier () at grpcomm_bad_module.c:277
#4  0x00000037fa04bd57 in ompi_mpi_init (argc=1, argv=0x7fff253658f8,
    requested=<value optimized out>, provided=<value optimized out>)
    at runtime/ompi_mpi_init.c:771

#5 0x00000037fa063c70 in PMPI_Init (argc=0x7fff253657fc,argv=0x7fff253657f0)

    at pinit.c:84
#6  0x00000000004006e6 in main (argc=1, argv=0x7fff253658f8)
    at mpihello-long.c:11

Those are both from a test with 1.5. The 1.4 errors are essentiallyidentical, with the differences mainly in line numbers. I'm happy to postfull logs, but I'm trying (albeit unsuccessfully) to keep this fromturning into a novel. I'm happy to do as much debugging as I can -- I'mpretty motivated to get this working.


Thanks for any insights.

--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF

[OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE

Reply via email to