Thanks!! Really appreciate your help - I'll try to figure out what went wrong and get back to you
On Jun 4, 2014, at 8:07 AM, Fischer, Greg A. <fisch...@westinghouse.com> wrote: > I re-ran with 1 processor and got more information. How about this? > > Core was generated by `ring_c'. > Program terminated with signal 11, Segmentation fault. > #0 opal_memory_ptmalloc2_int_malloc (av=0x2b48f6300020, > bytes=47592367980728) at > ../../../../../openmpi-1.8.1/opal/mca/memory/linux/malloc.c:4098 > 4098 bck->fd = unsorted_chunks(av); > (gdb) bt > #0 opal_memory_ptmalloc2_int_malloc (av=0x2b48f6300020, > bytes=47592367980728) at > ../../../../../openmpi-1.8.1/opal/mca/memory/linux/malloc.c:4098 > #1 0x00002b48f2a15e38 in opal_memory_ptmalloc2_malloc (bytes=47592367980576) > at ../../../../../openmpi-1.8.1/opal/mca/memory/linux/malloc.c:3433 > #2 0x00002b48f2a15b36 in opal_memory_linux_malloc_hook (sz=47592367980576, > caller=0x2b48f63000b8) at > ../../../../../openmpi-1.8.1/opal/mca/memory/linux/hooks.c:691 > #3 0x00002b48f2374b90 in vasprintf () from /lib64/libc.so.6 > #4 0x00002b48f2354148 in asprintf () from /lib64/libc.so.6 > #5 0x00002b48f26dc7d1 in orte_oob_base_get_addr (uri=0x2b48f6300020) at > ../../../../openmpi-1.8.1/orte/mca/oob/base/oob_base_stubs.c:234 > #6 0x00002b48f53e7d4a in orte_rml_oob_get_uri () at > ../../../../../openmpi-1.8.1/orte/mca/rml/oob/rml_oob_contact.c:36 > #7 0x00002b48f26fa181 in orte_routed_base_register_sync (setup=32 ' ') at > ../../../../openmpi-1.8.1/orte/mca/routed/base/routed_base_fns.c:301 > #8 0x00002b48f4bbcccf in init_routes (job=4130340896, ndat=0x2b48f63000b8) > at ../../../../../openmpi-1.8.1/orte/mca/routed/binomial/routed_binomial.c:705 > #9 0x00002b48f26c615d in orte_ess_base_app_setup (db_restrict_local=32 ' ') > at ../../../../openmpi-1.8.1/orte/mca/ess/base/ess_base_std_app.c:245 > #10 0x00002b48f45b069f in rte_init () at > ../../../../../openmpi-1.8.1/orte/mca/ess/env/ess_env_module.c:146 > #11 0x00002b48f26935ab in orte_init (pargc=0x2b48f6300020, > pargv=0x2b48f63000b8, flags=8) at > ../../openmpi-1.8.1/orte/runtime/orte_init.c:148 > #12 0x00002b48f1739d38 in ompi_mpi_init (argc=1, argv=0x7fffebf0d1f8, > requested=8, provided=0x0) at > ../../openmpi-1.8.1/ompi/runtime/ompi_mpi_init.c:464 > #13 0x00002b48f1760a37 in PMPI_Init (argc=0x2b48f6300020, > argv=0x2b48f63000b8) at pinit.c:84 > #14 0x00000000004024ef in main (argc=1, argv=0x7fffebf0d1f8) at ring_c.c:19 > > From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain > Sent: Wednesday, June 04, 2014 11:00 AM > To: Open MPI Users > Subject: Re: [OMPI users] intermittent segfaults with openib on ring_c.c > > Does the trace go any further back? Your prior trace seemed to indicate an > error in our OOB framework, but in a very basic place. Looks like it could be > an uninitialized variable, and having the line number down as deep as > possible might help identify the source > > > On Jun 4, 2014, at 7:55 AM, Fischer, Greg A. <fisch...@westinghouse.com> > wrote: > > > Oops, ulimit was set improperly. I generated a core file, loaded it in GDB, > and ran a backtrace: > > Core was generated by `ring_c'. > Program terminated with signal 11, Segmentation fault. > #0 opal_memory_ptmalloc2_int_malloc (av=0x2b8e4fd00020, > bytes=47890224382136) at > ../../../../../openmpi-1.8.1/opal/mca/memory/linux/malloc.c:4098 > 4098 bck->fd = unsorted_chunks(av); > (gdb) bt > #0 opal_memory_ptmalloc2_int_malloc (av=0x2b8e4fd00020, > bytes=47890224382136) at > ../../../../../openmpi-1.8.1/opal/mca/memory/linux/malloc.c:4098 > #1 0x0000000000000000 in ?? () > > Is that helpful? > > Greg > > From: Fischer, Greg A. > Sent: Wednesday, June 04, 2014 10:17 AM > To: 'Open MPI Users' > Cc: Fischer, Greg A. > Subject: RE: [OMPI users] intermittent segfaults with openib on ring_c.c > > I recompiled with “—enable-debug” but it doesn’t seem to be providing any > more information or a core dump. I’m compiling ring.c with: > > mpicc ring_c.c -g -traceback -o ring_c > > and running with: > > mpirun -np 4 --mca btl openib,self ring_c > > and I’m getting: > > [binf112:05845] *** Process received signal *** > [binf112:05845] Signal: Segmentation fault (11) > [binf112:05845] Signal code: Address not mapped (1) > [binf112:05845] Failing at address: 0x10 > [binf112:05845] [ 0] /lib64/libpthread.so.0(+0xf7c0)[0x2b2fa44d57c0] > [binf112:05845] [ 1] > /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libopen-pal.so.6(opal_memory_ptmalloc2_int_malloc+0x4b3)[0x2b2fa4ff2b03] > [binf112:05845] [ 2] > /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libopen-pal.so.6(opal_memory_ptmalloc2_malloc+0x58)[0x2b2fa4ff5288] > [binf112:05845] [ 3] > /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libopen-pal.so.6(+0xd1f86)[0x2b2fa4ff4f86] > [binf112:05845] [ 4] /lib64/libc.so.6(vasprintf+0x3e)[0x2b2fa4957a7e] > [binf112:05845] [ 5] /lib64/libc.so.6(asprintf+0x88)[0x2b2fa4937148] > [binf112:05845] [ 6] > /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libopen-rte.so.7(orte_util_convert_process_name_to_string+0xe2)[0x2b2fa4c873e2] > [binf112:05845] [ 7] > /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libopen-rte.so.7(orte_oob_base_get_addr+0x25)[0x2b2fa4cbdb15] > [binf112:05845] [ 8] > /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/openmpi/mca_rml_oob.so(orte_rml_oob_get_uri+0xa)[0x2b2fa79c5d2a] > [binf112:05845] [ 9] > /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libopen-rte.so.7(orte_routed_base_register_sync+0x1fd)[0x2b2fa4cdae7d] > [binf112:05845] [10] > /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/openmpi/mca_routed_binomial.so(+0x3c7b)[0x2b2fa719bc7b] > [binf112:05845] [11] > /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libopen-rte.so.7(orte_ess_base_app_setup+0x3ad)[0x2b2fa4ca7c8d] > [binf112:05845] [12] > /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/openmpi/mca_ess_env.so(+0x169f)[0x2b2fa6b8f69f] > [binf112:05845] [13] > /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libopen-rte.so.7(orte_init+0x17b)[0x2b2fa4c764bb] > [binf112:05845] [14] > /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libmpi.so.1(ompi_mpi_init+0x438)[0x2b2fa3d1e198] > [binf112:05845] [15] > /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libmpi.so.1(MPI_Init+0xf7)[0x2b2fa3d44947] > [binf112:05845] [16] ring_c[0x4024ef] > [binf112:05845] [17] /lib64/libc.so.6(__libc_start_main+0xe6)[0x2b2fa4906c36] > [binf112:05845] [18] ring_c[0x4023f9] > [binf112:05845] *** End of error message *** > -------------------------------------------------------------------------- > mpirun noticed that process rank 3 with PID 5845 on node xxxx112 exited on > signal 11 (Segmentation fault). > -------------------------------------------------------------------------- > > Does any of that help? > > Greg > > From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain > Sent: Tuesday, June 03, 2014 11:54 PM > To: Open MPI Users > Subject: Re: [OMPI users] intermittent segfaults with openib on ring_c.c > > Sounds odd - can you configure OMPI --enable-debug and run it again? If it > fails and you can get a core dump, could you tell us the line number where it > is failing? > > > On Jun 3, 2014, at 9:58 AM, Fischer, Greg A. <fisch...@westinghouse.com> > wrote: > > > Apologies – I forgot to add some of the information requested by the FAQ: > > 1. OpenFabrics is provided by the Linux distribution: > > [binf102:fischega] $ rpm -qa | grep ofed > ofed-kmp-default-1.5.4.1_3.0.76_0.11-0.11.5 > ofed-1.5.4.1-0.11.5 > ofed-doc-1.5.4.1-0.11.5 > > > > 2. Linux Distro / Kernel: > > [binf102:fischega] $ cat /etc/SuSE-release > SUSE Linux Enterprise Server 11 (x86_64) > VERSION = 11 > PATCHLEVEL = 3 > > [binf102:fischega] $ uname –a > Linux xxxx102 3.0.76-0.11-default #1 SMP Fri Jun 14 08:21:43 UTC 2013 > (ccab990) x86_64 x86_64 x86_64 GNU/Linux > > > > 3. Not sure which subnet manger is being used – I think OpenSM, but > I’ll need to check with my administrators. > > > > 4. Output of ibv_devinfo is attached. > > > > 5. Ifconfig output is attached. > > > > 6. Ulimit –l output: > > [binf102:fischega] $ ulimit –l > unlimited > > Greg > > > > From: Fischer, Greg A. > Sent: Tuesday, June 03, 2014 12:38 PM > To: Open MPI Users > Cc: Fischer, Greg A. > Subject: intermittent segfaults with openib on ring_c.c > > Hello openmpi-users, > > I’m running into a perplexing problem on a new system, whereby I’m > experiencing intermittent segmentation faults when I run the ring_c.c example > and use the openib BTL. See an example below. Approximately 50% of the time > it provides the expected output, but the other 50% of the time, it segfaults. > LD_LIBRARY_PATH is set correctly, and the version of “mpirun” being invoked > is correct. The output of ompi_info –all is attached. > > One potential problem may be that the system that OpenMPI was compiled on is > mostly the same as the system where it is being executed, but there are some > differences in the installed packages. I’ve checked the critical ones > (libibverbs, librdmacm, libmlx4-rdmav2, etc.), and they appear to be the same. > > Can anyone suggest how I might start tracking this problem down? > > Thanks, > Greg > > [binf102:fischega] $ mpirun -np 2 --mca btl openib,self ring_c > [binf102:31268] *** Process received signal *** > [binf102:31268] Signal: Segmentation fault (11) > [binf102:31268] Signal code: Address not mapped (1) > [binf102:31268] Failing at address: 0x10 > [binf102:31268] [ 0] /lib64/libpthread.so.0(+0xf7c0) [0x2b42213f57c0] > [binf102:31268] [ 1] > /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.6.5/lib/libmpi.so.1(opal_memory_ptmalloc2_int_malloc+0x4b3) > [0x2b42203fd7e3] > [binf102:31268] [ 2] > /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.6.5/lib/libmpi.so.1(opal_memory_ptmalloc2_int_memalign+0x8b) > [0x2b4220400d3b] > [binf102:31268] [ 3] > /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.6.5/lib/libmpi.so.1(opal_memory_ptmalloc2_memalign+0x6f) > [0x2b42204008ef] > [binf102:31268] [ 4] > /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.6.5/lib/libmpi.so.1(+0x117876) > [0x2b4220400876] > [binf102:31268] [ 5] > /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.6.5/lib/openmpi/mca_btl_openib.so(+0xc34c) > [0x2b422572334c] > [binf102:31268] [ 6] > /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.6.5/lib/libmpi.so.1(opal_class_initialize+0xaa) > [0x2b422041d64a] > [binf102:31268] [ 7] > /xxxx/yyyy_ib/intel-12.1.0.233/toolset/openmpi-1.6.5/lib/openmpi/mca_btl_openib.so(+0x1f12f) > [0x2b422573612f] > [binf102:31268] [ 8] /lib64/libpthread.so.0(+0x77b6) [0x2b42213ed7b6] > [binf102:31268] [ 9] /lib64/libc.so.6(clone+0x6d) [0x2b42216dcd6d] > [binf102:31268] *** End of error message *** > -------------------------------------------------------------------------- > mpirun noticed that process rank 0 with PID 31268 on node xxxx102 exited on > signal 11 (Segmentation fault). > -------------------------------------------------------------------------- > <ibv_devinfo.txt><ifconfig.txt>_______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users