Apologies - I forgot to add some of the information requested by the FAQ:
1. OpenFabrics is provided by the Linux distribution: [binf102:fischega] $ rpm -qa | grep ofed ofed-kmp-default-1.5.4.1_3.0.76_0.11-0.11.5 ofed-1.5.4.1-0.11.5 ofed-doc-1.5.4.1-0.11.5 2. Linux Distro / Kernel: [binf102:fischega] $ cat /etc/SuSE-release SUSE Linux Enterprise Server 11 (x86_64) VERSION = 11 PATCHLEVEL = 3 [binf102:fischega] $ uname -a Linux casl102 3.0.76-0.11-default #1 SMP Fri Jun 14 08:21:43 UTC 2013 (ccab990) x86_64 x86_64 x86_64 GNU/Linux 3. Not sure which subnet manger is being used - I think OpenSM, but I'll need to check with my administrators. 4. Output of ibv_devinfo is attached. 5. Ifconfig output is attached. 6. Ulimit -l output: [binf102:fischega] $ ulimit -l unlimited Greg From: Fischer, Greg A. Sent: Tuesday, June 03, 2014 12:38 PM To: Open MPI Users Cc: Fischer, Greg A. Subject: intermittent segfaults with openib on ring_c.c Hello openmpi-users, I'm running into a perplexing problem on a new system, whereby I'm experiencing intermittent segmentation faults when I run the ring_c.c example and use the openib BTL. See an example below. Approximately 50% of the time it provides the expected output, but the other 50% of the time, it segfaults. LD_LIBRARY_PATH is set correctly, and the version of "mpirun" being invoked is correct. The output of ompi_info -all is attached. One potential problem may be that the system that OpenMPI was compiled on is mostly the same as the system where it is being executed, but there are some differences in the installed packages. I've checked the critical ones (libibverbs, librdmacm, libmlx4-rdmav2, etc.), and they appear to be the same. Can anyone suggest how I might start tracking this problem down? Thanks, Greg [binf102:fischega] $ mpirun -np 2 --mca btl openib,self ring_c [binf102:31268] *** Process received signal *** [binf102:31268] Signal: Segmentation fault (11) [binf102:31268] Signal code: Address not mapped (1) [binf102:31268] Failing at address: 0x10 [binf102:31268] [ 0] /lib64/libpthread.so.0(+0xf7c0) [0x2b42213f57c0] [binf102:31268] [ 1] /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.6.5/lib/libmpi.so.1(opal_memory_ptmalloc2_int_malloc+0x4b3) [0x2b42203fd7e3] [binf102:31268] [ 2] /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.6.5/lib/libmpi.so.1(opal_memory_ptmalloc2_int_memalign+0x8b) [0x2b4220400d3b] [binf102:31268] [ 3] /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.6.5/lib/libmpi.so.1(opal_memory_ptmalloc2_memalign+0x6f) [0x2b42204008ef] [binf102:31268] [ 4] /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.6.5/lib/libmpi.so.1(+0x117876) [0x2b4220400876] [binf102:31268] [ 5] /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.6.5/lib/openmpi/mca_btl_openib.so(+0xc34c) [0x2b422572334c] [binf102:31268] [ 6] /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.6.5/lib/libmpi.so.1(opal_class_initialize+0xaa) [0x2b422041d64a] [binf102:31268] [ 7] /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.6.5/lib/openmpi/mca_btl_openib.so(+0x1f12f) [0x2b422573612f] [binf102:31268] [ 8] /lib64/libpthread.so.0(+0x77b6) [0x2b42213ed7b6] [binf102:31268] [ 9] /lib64/libc.so.6(clone+0x6d) [0x2b42216dcd6d] [binf102:31268] *** End of error message *** -------------------------------------------------------------------------- mpirun noticed that process rank 0 with PID 31268 on node xxxx102 exited on signal 11 (Segmentation fault). --------------------------------------------------------------------------
hca_id: mlx4_0 transport: InfiniBand (0) fw_ver: 2.8.000 node_guid: 0002:c903:0010:371e sys_image_guid: 0002:c903:0010:3721 vendor_id: 0x02c9 vendor_part_id: 26428 hw_ver: 0xB0 board_id: HP_0160000009 phys_port_cnt: 2 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 1 port_lid: 21 port_lmc: 0x00 link_layer: IB port: 2 state: PORT_DOWN (1) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 0 port_lid: 0 port_lmc: 0x00 link_layer: IB
eth0 Link encap:Ethernet HWaddr 3C:4A:92:F5:2F:B0 inet addr:10.179.32.21 Bcast:10.179.32.255 Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:166314 errors:0 dropped:0 overruns:0 frame:0 TX packets:76696 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:108061348 (103.0 Mb) TX bytes:7935071 (7.5 Mb) Memory:fbbc0000-fbbe0000 ib0 Link encap:InfiniBand HWaddr 80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 inet addr:9.9.10.21 Bcast:9.9.255.255 Mask:255.255.0.0 UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1 RX packets:109405 errors:0 dropped:0 overruns:0 frame:0 TX packets:40456 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:256 RX bytes:69470714 (66.2 Mb) TX bytes:5093337 (4.8 Mb) lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:94650 errors:0 dropped:0 overruns:0 frame:0 TX packets:94650 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:25057607 (23.8 Mb) TX bytes:25057607 (23.8 Mb)