Thanks for persevering with this. I'm far from sure that the information I am providing is of much use, largely because I'm pretty confused about what's going on. Anyway...
Brian Barrett wrote: > Can you rebuild Open MPI with debugging symbols (just setting CFLAGS > to -g during configure should do it), rebuild, and get a full call > stack with line numbers? For (superfluous) thoroughness, I did configure --enable-debug --enable-memdebug, plus CFLAGS,FFLAGS,FCFLAGS=-g. gdb tells me (abbreviated): [New Thread 2853808 (LWP 16590)] [New Thread 18697136 (LWP 16591)] Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 18697136 (LWP 16591)] 0x00e47a92 in _int_free (av=0xe75580, mem=0x9cb4190) at malloc.c:4371 4371 nextsize = chunksize(nextchunk); (gdb) bt #0 0x00e47a92 in _int_free (av=0xe75580, mem=0x9cb4190) at malloc.c:4371 #1 0x00e466fa in free (mem=0x9cb4190) at malloc.c:3501 #2 0x08154590 in for_deallocate. () #3 0x08154505 in for_dealloc_allocatable () #4 0x0805d71f in spline (x=0x9b37eb0, y=0x9ba5fe8, n=93, yp1=1e+40, ypn=1e+40, y2=0x9c63fe0) at subroutines.f90:167 (gdb) bt full 5 #0 0x00e47a92 in _int_free (av=0xe75580, mem=0x9cb4190) at malloc.c:4371 p = 0x9cb4188 size = 134776 fb = (mfastbinptr *) 0xe464fd nextchunk = 0x9cd5000 nextsize = 744 nextinuse = 15160704 prevsize = 14968205 bck = 0x11d48b4 fwd = 0x2e8 #1 0x00e466fa in free (mem=0x9cb4190) at malloc.c:3501 ar_ptr = 0xe75580 p = 0x9cb4188 hook = (void (*)(void *, const void *)) 0 #2 0x08154590 in for_deallocate. () No symbol table info available. #3 0x08154505 in for_dealloc_allocatable () No symbol table info available. #4 0x0805d71f in spline (x=0x9b37eb0, y=0x9ba5fe8, n=93, yp1=1e+40, ypn=1e+40, y2=0x9c63fe0) at subroutines.f90:167 un = 0 sig = 0.5 qn = 0 p = 1.8660254037844382 k = 0 i = 93 u = 0x11d4904 Totalview's memory debugger tells me: "Allocator returned a block already in use: heap may be corrupted" (at an allocation that gives the crash when the associated storage is deallocated). [valgrind] > The output might be useful to us, if we could take a look (at least, > on the OMPI build that fails). Again, doing this with a build of > Open MPI that contains debugging symbols would greatly increase the > usefulness to us. I have to suppress many (irrelevant, I think...) warnings, else valgrind stops reporting them before the crash. The final one is: ==10446== ==10446== Invalid read of size 4 ==10446== at 0x1C02FA92: _int_free (malloc.c:4371) ==10446== by 0x1C02E6F9: free (malloc.c:3501) ==10446== by 0x815458F: for_deallocate. (in /afs/slac.stanford.edu/g/ki/users/gmorris/cosmomc/benchmarks/cosmomc/coma-mpi-openmp/O0-ompi-1.1a1r8803-ifort9-memdebug/cosmomc) ==10446== by 0x8154504: for_dealloc_allocatable (in /afs/slac.stanford.edu/g/ki/users/gmorris/cosmomc/benchmarks/cosmomc/coma-mpi-openmp/O0-ompi-1.1a1r8803-ifort9-memdebug/cosmomc) ==10446== Address 0x8FD3004 is not stack'd, malloc'd or (recently) free'd Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR) Failing at addr:0x8fd3004 [0] func:/afs/slac.stanford.edu/g/ki/users/gmorris/tmp/ompi-1.1a1r8803-memdebug-ifort9/lib/libopal.so.0 [0x1c02987a] [1] func:[0x52bff000] [2] func:/afs/slac.stanford.edu/g/ki/users/gmorris/tmp/ompi-1.1a1r8803-memdebug-ifort9/lib/libopal.so.0(free+0xa6) [0x1c02e6fa] [3] func:./cosmomc(for_deallocate.+0x54) [0x8154590] [4] func:./cosmomc(for_dealloc_allocatable+0x5b) [0x8154505] [...] *** End of error message *** ==10446== ==10446== Process terminating with default action of signal 11 (SIGSEGV) ==10446== Access not within mapped region at address 0x4 ==10446== at 0x1C02FA92: _int_free (malloc.c:4371) ==10446== by 0x1C02E6F9: free (malloc.c:3501) ==10446== by 0x815458F: for_deallocate. (in /afs/slac.stanford.edu/g/ki/users/gmorris/cosmomc/benchmarks/cosmomc/coma-mpi-openmp/O0-ompi-1.1a1r8803-ifort9-memdebug/cosmomc) ==10446== by 0x8154504: for_dealloc_allocatable (in /afs/slac.stanford.edu/g/ki/users/gmorris/cosmomc/benchmarks/cosmomc/coma-mpi-openmp/O0-ompi-1.1a1r8803-ifort9-memdebug/cosmomc) ==10446==