Sounds odd - can you configure OMPI --enable-debug and run it again? If it fails and you can get a core dump, could you tell us the line number where it is failing?
On Jun 3, 2014, at 9:58 AM, Fischer, Greg A. <fisch...@westinghouse.com> wrote: > Apologies – I forgot to add some of the information requested by the FAQ: > > 1. OpenFabrics is provided by the Linux distribution: > > [binf102:fischega] $ rpm -qa | grep ofed > ofed-kmp-default-1.5.4.1_3.0.76_0.11-0.11.5 > ofed-1.5.4.1-0.11.5 > ofed-doc-1.5.4.1-0.11.5 > > 2. Linux Distro / Kernel: > > [binf102:fischega] $ cat /etc/SuSE-release > SUSE Linux Enterprise Server 11 (x86_64) > VERSION = 11 > PATCHLEVEL = 3 > > [binf102:fischega] $ uname –a > Linux casl102 3.0.76-0.11-default #1 SMP Fri Jun 14 08:21:43 UTC 2013 > (ccab990) x86_64 x86_64 x86_64 GNU/Linux > > 3. Not sure which subnet manger is being used – I think OpenSM, but > I’ll need to check with my administrators. > > 4. Output of ibv_devinfo is attached. > > 5. Ifconfig output is attached. > > 6. Ulimit –l output: > > [binf102:fischega] $ ulimit –l > unlimited > > Greg > > From: Fischer, Greg A. > Sent: Tuesday, June 03, 2014 12:38 PM > To: Open MPI Users > Cc: Fischer, Greg A. > Subject: intermittent segfaults with openib on ring_c.c > > Hello openmpi-users, > > I’m running into a perplexing problem on a new system, whereby I’m > experiencing intermittent segmentation faults when I run the ring_c.c example > and use the openib BTL. See an example below. Approximately 50% of the time > it provides the expected output, but the other 50% of the time, it segfaults. > LD_LIBRARY_PATH is set correctly, and the version of “mpirun” being invoked > is correct. The output of ompi_info –all is attached. > > One potential problem may be that the system that OpenMPI was compiled on is > mostly the same as the system where it is being executed, but there are some > differences in the installed packages. I’ve checked the critical ones > (libibverbs, librdmacm, libmlx4-rdmav2, etc.), and they appear to be the same. > > Can anyone suggest how I might start tracking this problem down? > > Thanks, > Greg > > [binf102:fischega] $ mpirun -np 2 --mca btl openib,self ring_c > [binf102:31268] *** Process received signal *** > [binf102:31268] Signal: Segmentation fault (11) > [binf102:31268] Signal code: Address not mapped (1) > [binf102:31268] Failing at address: 0x10 > [binf102:31268] [ 0] /lib64/libpthread.so.0(+0xf7c0) [0x2b42213f57c0] > [binf102:31268] [ 1] > /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.6.5/lib/libmpi.so.1(opal_memory_ptmalloc2_int_malloc+0x4b3) > [0x2b42203fd7e3] > [binf102:31268] [ 2] > /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.6.5/lib/libmpi.so.1(opal_memory_ptmalloc2_int_memalign+0x8b) > [0x2b4220400d3b] > [binf102:31268] [ 3] > /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.6.5/lib/libmpi.so.1(opal_memory_ptmalloc2_memalign+0x6f) > [0x2b42204008ef] > [binf102:31268] [ 4] > /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.6.5/lib/libmpi.so.1(+0x117876) > [0x2b4220400876] > [binf102:31268] [ 5] > /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.6.5/lib/openmpi/mca_btl_openib.so(+0xc34c) > [0x2b422572334c] > [binf102:31268] [ 6] > /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.6.5/lib/libmpi.so.1(opal_class_initialize+0xaa) > [0x2b422041d64a] > [binf102:31268] [ 7] > /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.6.5/lib/openmpi/mca_btl_openib.so(+0x1f12f) > [0x2b422573612f] > [binf102:31268] [ 8] /lib64/libpthread.so.0(+0x77b6) [0x2b42213ed7b6] > [binf102:31268] [ 9] /lib64/libc.so.6(clone+0x6d) [0x2b42216dcd6d] > [binf102:31268] *** End of error message *** > -------------------------------------------------------------------------- > mpirun noticed that process rank 0 with PID 31268 on node xxxx102 exited on > signal 11 (Segmentation fault). > -------------------------------------------------------------------------- > <ibv_devinfo.txt><ifconfig.txt>_______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users