Hi, Just to be clear, what specific version of Open MPI produced the provided backtrace? This smells like a missing memory barrier problem.
-- Samuel K. Gutierrez Los Alamos National Laboratory On Mar 13, 2012, at 1:07 PM, Joshua Baker-LePain wrote: > I run a decent size (600+ nodes, 4000+ cores) heterogeneous (multiple > generations of x86_64 hardware) cluster. We use SGE (currently 6.1u4, which, > yes, is pretty ancient) and just upgraded from CentOS 5.7 to 6.2. We had been > using MPICH2 under CentOS 5, but I'd much rather use OpenMPI as packaged by > RH/CentOS. Our SGE queues are setup with a high priority queue, running > un-niced, and a low priority queue running at nice 19, each with 1 slot per > core on every node. > > I'm seeing consistent segfaults with OpenMPI when I submit jobs without > specifying a queue (meaning some threads run niced, others run un-niced). > This was initially reported to me by 2 users, each with their own code, but I > can reproduce it with my own very simple test program. The segfaults occur > whether I'm using the default OpenMPI version of 1.5 or compat-openmpi-1.4.3. > I'll note that I did upgrade the distro RPM of openmpi-1.5.3 to 1.5.4 to get > around the broken SGE integration > <https://bugzilla.redhat.com/show_bug.cgi?id=789150>. I can't absolutely say > that jobs run entirely in the high priority queue do not segfault. But, if > they do, it's not nearly as reproducible. The segfaults also don't seem to > occur if a job runs entirely on one node. > > The error logs of failed jobs contain a stanza like this for each thread > which segfaulted: > [opt207:03766] *** Process received signal *** > [opt207:03766] Signal: Segmentation fault (11) > [opt207:03766] Signal code: Address not mapped (1) > [opt207:03766] Failing at address: 0x2b4e279e778c > [opt207:03766] [ 0] /lib64/libpthread.so.0() [0x37f940f4a0] > [opt207:03766] [ 1] /usr/lib64/openmpi/lib/openmpi/mca_btl_sm.so(+0x42fc) > [0x2b17aa6002fc] > [opt207:03766] [ 2] /usr/lib64/openmpi/lib/libmpi.so.1(opal_progress+0x5a) > [0x37fa0d1aba] > [opt207:03766] [ 3] > /usr/lib64/openmpi/lib/openmpi/mca_grpcomm_bad.so(+0x24d5) [0x2b17a7d234d5] > [opt207:03766] [ 4] /usr/lib64/openmpi/lib/libmpi.so.1() [0x37fa04bd57] > [opt207:03766] [ 5] /usr/lib64/openmpi/lib/libmpi.so.1(MPI_Init+0x170) > [0x37fa063c70] > [opt207:03766] [ 6] /netapp/sali/jlb/mybin/mpihello-long.ompi-1.5-debug() > [0x4006e6] > [opt207:03766] [ 7] /lib64/libc.so.6(__libc_start_main+0xfd) [0x37f901ecdd] > [opt207:03766] [ 8] /netapp/sali/jlb/mybin/mpihello-long.ompi-1.5-debug() > [0x400609] > [opt207:03766] *** End of error message *** > > A backtrace of the core file looks like this: > #0 sm_fifo_read () at btl_sm.h:353 > #1 mca_btl_sm_component_progress () at btl_sm_component.c:588 > #2 0x00000037fa0d1aba in opal_progress () at runtime/opal_progress.c:207 > #3 0x00002b17a7d234d5 in barrier () at grpcomm_bad_module.c:277 > #4 0x00000037fa04bd57 in ompi_mpi_init (argc=1, argv=0x7fff253658f8, > requested=<value optimized out>, provided=<value optimized out>) > at runtime/ompi_mpi_init.c:771 > #5 0x00000037fa063c70 in PMPI_Init (argc=0x7fff253657fc, argv=0x7fff253657f0) > at pinit.c:84 > #6 0x00000000004006e6 in main (argc=1, argv=0x7fff253658f8) > at mpihello-long.c:11 > > Those are both from a test with 1.5. The 1.4 errors are essentially > identical, with the differences mainly in line numbers. I'm happy to post > full logs, but I'm trying (albeit unsuccessfully) to keep this from turning > into a novel. I'm happy to do as much debugging as I can -- I'm pretty > motivated to get this working. > > Thanks for any insights. > > -- > Joshua Baker-LePain > QB3 Shared Cluster Sysadmin > UCSF > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users