I run a decent size (600+ nodes, 4000+ cores) heterogeneous (multiple generations of x86_64 hardware) cluster. We use SGE (currently 6.1u4, which, yes, is pretty ancient) and just upgraded from CentOS 5.7 to 6.2. We had been using MPICH2 under CentOS 5, but I'd much rather use OpenMPI as packaged by RH/CentOS. Our SGE queues are setup with a high priority queue, running un-niced, and a low priority queue running at nice 19, each with 1 slot per core on every node.

I'm seeing consistent segfaults with OpenMPI when I submit jobs without specifying a queue (meaning some threads run niced, others run un-niced). This was initially reported to me by 2 users, each with their own code, but I can reproduce it with my own very simple test program. The segfaults occur whether I'm using the default OpenMPI version of 1.5 or compat-openmpi-1.4.3. I'll note that I did upgrade the distro RPM of openmpi-1.5.3 to 1.5.4 to get around the broken SGE integration <https://bugzilla.redhat.com/show_bug.cgi?id=789150>. I can't absolutely say that jobs run entirely in the high priority queue do not segfault. But, if they do, it's not nearly as reproducible. The segfaults also don't seem to occur if a job runs entirely on one node.

The error logs of failed jobs contain a stanza like this for each thread which segfaulted:
[opt207:03766] *** Process received signal ***
[opt207:03766] Signal: Segmentation fault (11)
[opt207:03766] Signal code: Address not mapped (1)
[opt207:03766] Failing at address: 0x2b4e279e778c
[opt207:03766] [ 0] /lib64/libpthread.so.0() [0x37f940f4a0]
[opt207:03766] [ 1] /usr/lib64/openmpi/lib/openmpi/mca_btl_sm.so(+0x42fc) 
[0x2b17aa6002fc]
[opt207:03766] [ 2] /usr/lib64/openmpi/lib/libmpi.so.1(opal_progress+0x5a) 
[0x37fa0d1aba]
[opt207:03766] [ 3] /usr/lib64/openmpi/lib/openmpi/mca_grpcomm_bad.so(+0x24d5) 
[0x2b17a7d234d5]
[opt207:03766] [ 4] /usr/lib64/openmpi/lib/libmpi.so.1() [0x37fa04bd57]
[opt207:03766] [ 5] /usr/lib64/openmpi/lib/libmpi.so.1(MPI_Init+0x170) 
[0x37fa063c70]
[opt207:03766] [ 6] /netapp/sali/jlb/mybin/mpihello-long.ompi-1.5-debug() 
[0x4006e6]
[opt207:03766] [ 7] /lib64/libc.so.6(__libc_start_main+0xfd) [0x37f901ecdd]
[opt207:03766] [ 8] /netapp/sali/jlb/mybin/mpihello-long.ompi-1.5-debug() 
[0x400609]
[opt207:03766] *** End of error message ***

A backtrace of the core file looks like this:
#0  sm_fifo_read () at btl_sm.h:353
#1  mca_btl_sm_component_progress () at btl_sm_component.c:588
#2  0x00000037fa0d1aba in opal_progress () at runtime/opal_progress.c:207
#3  0x00002b17a7d234d5 in barrier () at grpcomm_bad_module.c:277
#4  0x00000037fa04bd57 in ompi_mpi_init (argc=1, argv=0x7fff253658f8,
    requested=<value optimized out>, provided=<value optimized out>)
    at runtime/ompi_mpi_init.c:771
#5 0x00000037fa063c70 in PMPI_Init (argc=0x7fff253657fc, argv=0x7fff253657f0)
    at pinit.c:84
#6  0x00000000004006e6 in main (argc=1, argv=0x7fff253658f8)
    at mpihello-long.c:11

Those are both from a test with 1.5. The 1.4 errors are essentially identical, with the differences mainly in line numbers. I'm happy to post full logs, but I'm trying (albeit unsuccessfully) to keep this from turning into a novel. I'm happy to do as much debugging as I can -- I'm pretty motivated to get this working.

Thanks for any insights.

--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF

Reply via email to