Re: [OMPI users] intermittent segfaults with openib on ring_c.c

2014-06-04 Thread Ralph Castain
Aha!! I found this in our users mailing list archives: http://www.open-mpi.org/community/lists/users/2012/01/18091.php Looks like this is a known compiler vectorization issue. On Jun 4, 2014, at 1:52 PM, Fischer, Greg A. wrote: > Ralph, > > Thanks for looking. Let me know if there's any othe

Re: [OMPI users] intermittent segfaults with openib on ring_c.c

2014-06-04 Thread Fischer, Greg A.
Ralph, Thanks for looking. Let me know if there's any other testing that I can do. I recompiled with GCC and it works fine, so that lends credence to your theory that it has something to do with the Intel compilers, and possibly their interplay with SUSE. Greg -Original Message- From:

Re: [OMPI users] intermittent segfaults with openib on ring_c.c

2014-06-04 Thread Ralph Castain
Urggg...unfortunately, the people who know the most about that code are all at the MPI Forum this week, so we may not be able to fully address it until their return. It looks like you are still going down into that malloc interceptor, so I'm not correctly blocking it for you. This run segfa

Re: [OMPI users] intermittent segfaults with openib on ring_c.c

2014-06-04 Thread Fischer, Greg A.
Ralph, It segfaults. Here's the backtrace: Core was generated by `ring_c'. Program terminated with signal 11, Segmentation fault. #0 opal_memory_ptmalloc2_int_malloc (av=0x2b82b5300020, bytes=47840385564856) at ../../../../../openmpi-1.8.1/opal/mca/memory/linux/malloc.c:4098 4098 bck->

Re: [OMPI users] intermittent segfaults with openib on ring_c.c

2014-06-04 Thread Ralph Castain
Sorry for delay - digging my way out of the backlog. This is very strange as you are failing in a simple asprintf call. We check that all the players are non-NULL, and it appears that you are failing to allocate the memory for the resulting (rather short) string. I'm wondering if this is some s

Re: [OMPI users] intermittent segfaults with openib on ring_c.c

2014-06-04 Thread Ralph Castain
He isn't getting that far - he's failing in MPI_Init when the RTE attempts to connect to the local daemon On Jun 4, 2014, at 9:53 AM, Gus Correa wrote: > Hi Greg > > From your original email: > > >> [binf102:fischega] $ mpirun -np 2 --mca btl openib,self ring_c > > This may not fix the prob

Re: [OMPI users] intermittent segfaults with openib on ring_c.c

2014-06-04 Thread Gus Correa
Hi Greg From your original email: >> [binf102:fischega] $ mpirun -np 2 --mca btl openib,self ring_c This may not fix the problem, but have you tried to add the shared memory btl to your mca parameter? mpirun -np 2 --mca btl openib,sm,self ring_c As far as I know, sm is the preferred transport

Re: [OMPI users] OPENIB unknown transport errors

2014-06-04 Thread Tim Miller
Hi, I'd like to revive this thread, since I am still periodically getting errors of this type. I have built 1.8.1 with --enable-debug and run with -mca btl_openib_verbose 10. Unfortunately, this doesn't seem to provide any additional information that I can find useful. I've gone ahead and attached

Re: [OMPI users] intermittent segfaults with openib on ring_c.c

2014-06-04 Thread Ralph Castain
Thanks!! Really appreciate your help - I'll try to figure out what went wrong and get back to you On Jun 4, 2014, at 8:07 AM, Fischer, Greg A. wrote: > I re-ran with 1 processor and got more information. How about this? > > Core was generated by `ring_c'. > Program terminated with signal 11,

Re: [OMPI users] intermittent segfaults with openib on ring_c.c

2014-06-04 Thread Fischer, Greg A.
I re-ran with 1 processor and got more information. How about this? Core was generated by `ring_c'. Program terminated with signal 11, Segmentation fault. #0 opal_memory_ptmalloc2_int_malloc (av=0x2b48f6300020, bytes=47592367980728) at ../../../../../openmpi-1.8.1/opal/mca/memory/linux/malloc.c:

Re: [OMPI users] intermittent segfaults with openib on ring_c.c

2014-06-04 Thread Ralph Castain
Does the trace go any further back? Your prior trace seemed to indicate an error in our OOB framework, but in a very basic place. Looks like it could be an uninitialized variable, and having the line number down as deep as possible might help identify the source On Jun 4, 2014, at 7:55 AM, Fis

Re: [OMPI users] intermittent segfaults with openib on ring_c.c

2014-06-04 Thread Fischer, Greg A.
Oops, ulimit was set improperly. I generated a core file, loaded it in GDB, and ran a backtrace: Core was generated by `ring_c'. Program terminated with signal 11, Segmentation fault. #0 opal_memory_ptmalloc2_int_malloc (av=0x2b8e4fd00020, bytes=47890224382136) at ../../../../../openmpi-1.8.1/o

Re: [OMPI users] intermittent segfaults with openib on ring_c.c

2014-06-04 Thread Fischer, Greg A.
I recompiled with "-enable-debug" but it doesn't seem to be providing any more information or a core dump. I'm compiling ring.c with: mpicc ring_c.c -g -traceback -o ring_c and running with: mpirun -np 4 --mca btl openib,self ring_c and I'm getting: [binf112:05845] *** Process received signal