I run a decent size (600+ nodes, 4000+ cores) heterogeneous (multiple
generations of x86_64 hardware) cluster. We use SGE (currently 6.1u4,
which, yes, is pretty ancient) and just upgraded from CentOS 5.7 to 6.2.
We had been using MPICH2 under CentOS 5, but I'd much rather use OpenMPI
as packaged by RH/CentOS. Our SGE queues are setup with a high priority
queue, running un-niced, and a low priority queue running at nice 19, each
with 1 slot per core on every node.
I'm seeing consistent segfaults with OpenMPI when I submit jobs without
specifying a queue (meaning some threads run niced, others run un-niced).
This was initially reported to me by 2 users, each with their own code,
but I can reproduce it with my own very simple test program. The
segfaults occur whether I'm using the default OpenMPI version of 1.5 or
compat-openmpi-1.4.3. I'll note that I did upgrade the distro RPM of
openmpi-1.5.3 to 1.5.4 to get around the broken SGE integration
<https://bugzilla.redhat.com/show_bug.cgi?id=789150>. I can't
absolutely say that jobs run entirely in the high priority queue do
not segfault. But, if they do, it's not nearly as reproducible. The
segfaults also don't seem to occur if a job runs entirely on one node.
The error logs of failed jobs contain a stanza like this for each thread
which segfaulted:
[opt207:03766] *** Process received signal ***
[opt207:03766] Signal: Segmentation fault (11)
[opt207:03766] Signal code: Address not mapped (1)
[opt207:03766] Failing at address: 0x2b4e279e778c
[opt207:03766] [ 0] /lib64/libpthread.so.0() [0x37f940f4a0]
[opt207:03766] [ 1] /usr/lib64/openmpi/lib/openmpi/mca_btl_sm.so(+0x42fc)
[0x2b17aa6002fc]
[opt207:03766] [ 2] /usr/lib64/openmpi/lib/libmpi.so.1(opal_progress+0x5a)
[0x37fa0d1aba]
[opt207:03766] [ 3] /usr/lib64/openmpi/lib/openmpi/mca_grpcomm_bad.so(+0x24d5)
[0x2b17a7d234d5]
[opt207:03766] [ 4] /usr/lib64/openmpi/lib/libmpi.so.1() [0x37fa04bd57]
[opt207:03766] [ 5] /usr/lib64/openmpi/lib/libmpi.so.1(MPI_Init+0x170)
[0x37fa063c70]
[opt207:03766] [ 6] /netapp/sali/jlb/mybin/mpihello-long.ompi-1.5-debug()
[0x4006e6]
[opt207:03766] [ 7] /lib64/libc.so.6(__libc_start_main+0xfd) [0x37f901ecdd]
[opt207:03766] [ 8] /netapp/sali/jlb/mybin/mpihello-long.ompi-1.5-debug()
[0x400609]
[opt207:03766] *** End of error message ***
A backtrace of the core file looks like this:
#0 sm_fifo_read () at btl_sm.h:353
#1 mca_btl_sm_component_progress () at btl_sm_component.c:588
#2 0x00000037fa0d1aba in opal_progress () at runtime/opal_progress.c:207
#3 0x00002b17a7d234d5 in barrier () at grpcomm_bad_module.c:277
#4 0x00000037fa04bd57 in ompi_mpi_init (argc=1, argv=0x7fff253658f8,
requested=<value optimized out>, provided=<value optimized out>)
at runtime/ompi_mpi_init.c:771
#5 0x00000037fa063c70 in PMPI_Init (argc=0x7fff253657fc,
argv=0x7fff253657f0)
at pinit.c:84
#6 0x00000000004006e6 in main (argc=1, argv=0x7fff253658f8)
at mpihello-long.c:11
Those are both from a test with 1.5. The 1.4 errors are essentially
identical, with the differences mainly in line numbers. I'm happy to post
full logs, but I'm trying (albeit unsuccessfully) to keep this from
turning into a novel. I'm happy to do as much debugging as I can -- I'm
pretty motivated to get this working.
Thanks for any insights.
--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF