Thanks for persevering with this. I'm far from sure that the
information I am providing is of much use, largely because I'm pretty
confused about what's going on. Anyway...


Brian Barrett wrote:

> Can you rebuild Open MPI with debugging symbols (just setting CFLAGS
> to -g during configure should do it), rebuild, and get a full call
> stack with line numbers?

For (superfluous) thoroughness, I did configure --enable-debug
--enable-memdebug, plus CFLAGS,FFLAGS,FCFLAGS=-g.

gdb tells me (abbreviated):

[New Thread 2853808 (LWP 16590)]
[New Thread 18697136 (LWP 16591)]

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 18697136 (LWP 16591)]
0x00e47a92 in _int_free (av=0xe75580, mem=0x9cb4190) at malloc.c:4371
4371          nextsize = chunksize(nextchunk);
(gdb) bt
#0  0x00e47a92 in _int_free (av=0xe75580, mem=0x9cb4190) at malloc.c:4371
#1  0x00e466fa in free (mem=0x9cb4190) at malloc.c:3501
#2  0x08154590 in for_deallocate. ()
#3  0x08154505 in for_dealloc_allocatable ()
#4  0x0805d71f in spline (x=0x9b37eb0, y=0x9ba5fe8, n=93, yp1=1e+40, 
    ypn=1e+40, y2=0x9c63fe0) at subroutines.f90:167

(gdb) bt full 5
#0  0x00e47a92 in _int_free (av=0xe75580, mem=0x9cb4190) at malloc.c:4371
        p = 0x9cb4188
        size = 134776
        fb = (mfastbinptr *) 0xe464fd
        nextchunk = 0x9cd5000
        nextsize = 744
        nextinuse = 15160704
        prevsize = 14968205
        bck = 0x11d48b4
        fwd = 0x2e8
#1  0x00e466fa in free (mem=0x9cb4190) at malloc.c:3501
        ar_ptr = 0xe75580
        p = 0x9cb4188
        hook = (void (*)(void *, const void *)) 0
#2  0x08154590 in for_deallocate. ()
No symbol table info available.
#3  0x08154505 in for_dealloc_allocatable ()
No symbol table info available.
#4  0x0805d71f in spline (x=0x9b37eb0, y=0x9ba5fe8, n=93, yp1=1e+40, 
    ypn=1e+40, y2=0x9c63fe0) at subroutines.f90:167
        un = 0
        sig = 0.5
        qn = 0
        p = 1.8660254037844382
        k = 0
        i = 93
        u = 0x11d4904


Totalview's memory debugger tells me: "Allocator returned a block
already in use: heap may be corrupted" (at an allocation that gives
the crash when the associated storage is deallocated).


[valgrind]
> The output might be useful to us, if we could take a look (at least,  
> on the OMPI build that fails).  Again, doing this with a build of  
> Open MPI that contains debugging symbols would greatly increase the  
> usefulness to us.

I have to suppress many (irrelevant, I think...) warnings, else
valgrind stops reporting them before the crash. The final one is:

==10446== 
==10446== Invalid read of size 4
==10446==    at 0x1C02FA92: _int_free (malloc.c:4371)
==10446==    by 0x1C02E6F9: free (malloc.c:3501)
==10446==    by 0x815458F: for_deallocate. (in 
/afs/slac.stanford.edu/g/ki/users/gmorris/cosmomc/benchmarks/cosmomc/coma-mpi-openmp/O0-ompi-1.1a1r8803-ifort9-memdebug/cosmomc)
==10446==    by 0x8154504: for_dealloc_allocatable (in 
/afs/slac.stanford.edu/g/ki/users/gmorris/cosmomc/benchmarks/cosmomc/coma-mpi-openmp/O0-ompi-1.1a1r8803-ifort9-memdebug/cosmomc)
==10446==  Address 0x8FD3004 is not stack'd, malloc'd or (recently) free'd
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0x8fd3004
[0] 
func:/afs/slac.stanford.edu/g/ki/users/gmorris/tmp/ompi-1.1a1r8803-memdebug-ifort9/lib/libopal.so.0
 [0x1c02987a]
[1] func:[0x52bff000]
[2] 
func:/afs/slac.stanford.edu/g/ki/users/gmorris/tmp/ompi-1.1a1r8803-memdebug-ifort9/lib/libopal.so.0(free+0xa6)
 [0x1c02e6fa]
[3] func:./cosmomc(for_deallocate.+0x54) [0x8154590]
[4] func:./cosmomc(for_dealloc_allocatable+0x5b) [0x8154505]
[...]
*** End of error message ***
==10446== 
==10446== Process terminating with default action of signal 11 (SIGSEGV)
==10446==  Access not within mapped region at address 0x4
==10446==    at 0x1C02FA92: _int_free (malloc.c:4371)
==10446==    by 0x1C02E6F9: free (malloc.c:3501)
==10446==    by 0x815458F: for_deallocate. (in 
/afs/slac.stanford.edu/g/ki/users/gmorris/cosmomc/benchmarks/cosmomc/coma-mpi-openmp/O0-ompi-1.1a1r8803-ifort9-memdebug/cosmomc)
==10446==    by 0x8154504: for_dealloc_allocatable (in 
/afs/slac.stanford.edu/g/ki/users/gmorris/cosmomc/benchmarks/cosmomc/coma-mpi-openmp/O0-ompi-1.1a1r8803-ifort9-memdebug/cosmomc)
==10446== 

Reply via email to