Does the trace go any further back? Your prior trace seemed to indicate an 
error in our OOB framework, but in a very basic place. Looks like it could be 
an uninitialized variable, and having the line number down as deep as possible 
might help identify the source


On Jun 4, 2014, at 7:55 AM, Fischer, Greg A. <fisch...@westinghouse.com> wrote:

> Oops, ulimit was set improperly. I generated a core file, loaded it in GDB, 
> and ran a backtrace:
>  
> Core was generated by `ring_c'.
> Program terminated with signal 11, Segmentation fault.
> #0  opal_memory_ptmalloc2_int_malloc (av=0x2b8e4fd00020, 
> bytes=47890224382136) at 
> ../../../../../openmpi-1.8.1/opal/mca/memory/linux/malloc.c:4098
> 4098          bck->fd = unsorted_chunks(av);
> (gdb) bt
> #0  opal_memory_ptmalloc2_int_malloc (av=0x2b8e4fd00020, 
> bytes=47890224382136) at 
> ../../../../../openmpi-1.8.1/opal/mca/memory/linux/malloc.c:4098
> #1  0x0000000000000000 in ?? ()
>  
> Is that helpful?
>  
> Greg
>  
> From: Fischer, Greg A. 
> Sent: Wednesday, June 04, 2014 10:17 AM
> To: 'Open MPI Users'
> Cc: Fischer, Greg A.
> Subject: RE: [OMPI users] intermittent segfaults with openib on ring_c.c
>  
> I recompiled with “—enable-debug” but it doesn’t seem to be providing any 
> more information or a core dump. I’m compiling ring.c with:
>  
> mpicc ring_c.c -g -traceback -o ring_c
>  
> and running with:
>  
> mpirun -np 4 --mca btl openib,self ring_c
>  
> and I’m getting:
>  
> [binf112:05845] *** Process received signal ***
> [binf112:05845] Signal: Segmentation fault (11)
> [binf112:05845] Signal code: Address not mapped (1)
> [binf112:05845] Failing at address: 0x10
> [binf112:05845] [ 0] /lib64/libpthread.so.0(+0xf7c0)[0x2b2fa44d57c0]
> [binf112:05845] [ 1] 
> /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libopen-pal.so.6(opal_memory_ptmalloc2_int_malloc+0x4b3)[0x2b2fa4ff2b03]
> [binf112:05845] [ 2] 
> /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libopen-pal.so.6(opal_memory_ptmalloc2_malloc+0x58)[0x2b2fa4ff5288]
> [binf112:05845] [ 3] 
> /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libopen-pal.so.6(+0xd1f86)[0x2b2fa4ff4f86]
> [binf112:05845] [ 4] /lib64/libc.so.6(vasprintf+0x3e)[0x2b2fa4957a7e]
> [binf112:05845] [ 5] /lib64/libc.so.6(asprintf+0x88)[0x2b2fa4937148]
> [binf112:05845] [ 6] 
> /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libopen-rte.so.7(orte_util_convert_process_name_to_string+0xe2)[0x2b2fa4c873e2]
> [binf112:05845] [ 7] 
> /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libopen-rte.so.7(orte_oob_base_get_addr+0x25)[0x2b2fa4cbdb15]
> [binf112:05845] [ 8] 
> /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/openmpi/mca_rml_oob.so(orte_rml_oob_get_uri+0xa)[0x2b2fa79c5d2a]
> [binf112:05845] [ 9] 
> /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libopen-rte.so.7(orte_routed_base_register_sync+0x1fd)[0x2b2fa4cdae7d]
> [binf112:05845] [10] 
> /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/openmpi/mca_routed_binomial.so(+0x3c7b)[0x2b2fa719bc7b]
> [binf112:05845] [11] 
> /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libopen-rte.so.7(orte_ess_base_app_setup+0x3ad)[0x2b2fa4ca7c8d]
> [binf112:05845] [12] 
> /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/openmpi/mca_ess_env.so(+0x169f)[0x2b2fa6b8f69f]
> [binf112:05845] [13] 
> /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libopen-rte.so.7(orte_init+0x17b)[0x2b2fa4c764bb]
> [binf112:05845] [14] 
> /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libmpi.so.1(ompi_mpi_init+0x438)[0x2b2fa3d1e198]
> [binf112:05845] [15] 
> /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libmpi.so.1(MPI_Init+0xf7)[0x2b2fa3d44947]
> [binf112:05845] [16] ring_c[0x4024ef]
> [binf112:05845] [17] /lib64/libc.so.6(__libc_start_main+0xe6)[0x2b2fa4906c36]
> [binf112:05845] [18] ring_c[0x4023f9]
> [binf112:05845] *** End of error message ***
> --------------------------------------------------------------------------
> mpirun noticed that process rank 3 with PID 5845 on node xxxx112 exited on 
> signal 11 (Segmentation fault).
> --------------------------------------------------------------------------
>  
> Does any of that help?
>  
> Greg
>  
> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
> Sent: Tuesday, June 03, 2014 11:54 PM
> To: Open MPI Users
> Subject: Re: [OMPI users] intermittent segfaults with openib on ring_c.c
>  
> Sounds odd - can you configure OMPI --enable-debug and run it again? If it 
> fails and you can get a core dump, could you tell us the line number where it 
> is failing?
>  
>  
> On Jun 3, 2014, at 9:58 AM, Fischer, Greg A. <fisch...@westinghouse.com> 
> wrote:
>  
> 
> Apologies – I forgot to add some of the information requested by the FAQ:
>  
> 1.       OpenFabrics is provided by the Linux distribution:
> 
> [binf102:fischega] $ rpm -qa | grep ofed
> ofed-kmp-default-1.5.4.1_3.0.76_0.11-0.11.5
> ofed-1.5.4.1-0.11.5
> ofed-doc-1.5.4.1-0.11.5
> 
> 
> 2.       Linux Distro / Kernel:
> 
> [binf102:fischega] $ cat /etc/SuSE-release
> SUSE Linux Enterprise Server 11 (x86_64)
> VERSION = 11
> PATCHLEVEL = 3
> 
> [binf102:fischega] $ uname –a
> Linux xxxx102 3.0.76-0.11-default #1 SMP Fri Jun 14 08:21:43 UTC 2013 
> (ccab990) x86_64 x86_64 x86_64 GNU/Linux
> 
> 
> 3.       Not sure which subnet manger is being used – I think OpenSM, but 
> I’ll need to check with my administrators.
> 
> 
> 4.       Output of ibv_devinfo is attached.
> 
> 
> 5.       Ifconfig output is attached.
> 
> 
> 6.       Ulimit –l output:
> 
> [binf102:fischega] $ ulimit –l
> unlimited
>  
> Greg
> 
> 
> From: Fischer, Greg A. 
> Sent: Tuesday, June 03, 2014 12:38 PM
> To: Open MPI Users
> Cc: Fischer, Greg A.
> Subject: intermittent segfaults with openib on ring_c.c
>  
> Hello openmpi-users,
>  
> I’m running into a perplexing problem on a new system, whereby I’m 
> experiencing intermittent segmentation faults when I run the ring_c.c example 
> and use the openib BTL. See an example below. Approximately 50% of the time 
> it provides the expected output, but the other 50% of the time, it segfaults. 
> LD_LIBRARY_PATH is set correctly, and the version of “mpirun” being invoked 
> is correct. The output of ompi_info –all is attached.
>  
> One potential problem may be that the system that OpenMPI was compiled on is 
> mostly the same as the system where it is being executed, but there are some 
> differences in the installed packages. I’ve checked the critical ones 
> (libibverbs, librdmacm, libmlx4-rdmav2, etc.), and they appear to be the same.
>  
> Can anyone suggest how I might start tracking this problem down?
>  
> Thanks,
> Greg
>  
> [binf102:fischega] $ mpirun -np 2 --mca btl openib,self ring_c
> [binf102:31268] *** Process received signal ***
> [binf102:31268] Signal: Segmentation fault (11)
> [binf102:31268] Signal code: Address not mapped (1)
> [binf102:31268] Failing at address: 0x10
> [binf102:31268] [ 0] /lib64/libpthread.so.0(+0xf7c0) [0x2b42213f57c0]
> [binf102:31268] [ 1] 
> /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.6.5/lib/libmpi.so.1(opal_memory_ptmalloc2_int_malloc+0x4b3)
>  [0x2b42203fd7e3]
> [binf102:31268] [ 2] 
> /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.6.5/lib/libmpi.so.1(opal_memory_ptmalloc2_int_memalign+0x8b)
>  [0x2b4220400d3b]
> [binf102:31268] [ 3] 
> /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.6.5/lib/libmpi.so.1(opal_memory_ptmalloc2_memalign+0x6f)
>  [0x2b42204008ef]
> [binf102:31268] [ 4] 
> /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.6.5/lib/libmpi.so.1(+0x117876)
>  [0x2b4220400876]
> [binf102:31268] [ 5] 
> /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.6.5/lib/openmpi/mca_btl_openib.so(+0xc34c)
>  [0x2b422572334c]
> [binf102:31268] [ 6] 
> /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.6.5/lib/libmpi.so.1(opal_class_initialize+0xaa)
>  [0x2b422041d64a]
> [binf102:31268] [ 7] 
> /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.6.5/lib/openmpi/mca_btl_openib.so(+0x1f12f)
>  [0x2b422573612f]
> [binf102:31268] [ 8] /lib64/libpthread.so.0(+0x77b6) [0x2b42213ed7b6]
> [binf102:31268] [ 9] /lib64/libc.so.6(clone+0x6d) [0x2b42216dcd6d]
> [binf102:31268] *** End of error message ***
> --------------------------------------------------------------------------
> mpirun noticed that process rank 0 with PID 31268 on node xxxx102 exited on 
> signal 11 (Segmentation fault).
> --------------------------------------------------------------------------
> <ibv_devinfo.txt><ifconfig.txt>_______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>  
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to