Re: [OMPI users] SIGSEGV in OMPI 1.6.x

Jeff Squyres Thu, 6 Sep 2012 08:01:05 -0400

If you run into a segv in this code, it almost certainly means that you have 
heap corruption somewhere.  FWIW, that has *always* been what it meant when 
I've run into segv's in any code under in opal/mca/memory/linux/.  Meaning: my 
user code did something wrong, it created heap corruption, and then later some 
malloc() or free() caused a segv in this area of the code.


This code is the same ptmalloc memory allocator that has shipped in glibc for 
years.  I'll be hard-pressed to say that any code is 100% bug free :-), but I'd 
be surprised if there is a bug in this particular chunk of code.  

I'd run your code through valgrind or some other memory-checking debugger and 
see if that can shed any light on what's going on.


On Sep 6, 2012, at 12:06 AM, Yong Qin wrote:

> Hi,
> 
> While debugging a mysterious crash of a code, I was able to trace down
> to a SIGSEGV in OMPI 1.6 and 1.6.1. The offending code is in
> opal/mca/memory/linux/malloc.c. Please see the following gdb log.
> 
> (gdb) c
> Continuing.
> 
> Program received signal SIGSEGV, Segmentation fault.
> opal_memory_ptmalloc2_int_free (av=0x2fd0637, mem=0x203a746f74512000)
> at malloc.c:4385
> 4385          nextsize = chunksize(nextchunk);
> (gdb) l
> 4380           Consolidate other non-mmapped chunks as they arrive.
> 4381        */
> 4382
> 4383        else if (!chunk_is_mmapped(p)) {
> 4384          nextchunk = chunk_at_offset(p, size);
> 4385          nextsize = chunksize(nextchunk);
> 4386          assert(nextsize > 0);
> 4387
> 4388          /* consolidate backward */
> 4389          if (!prev_inuse(p)) {
> (gdb) bt
> #0  opal_memory_ptmalloc2_int_free (av=0x2fd0637,
> mem=0x203a746f74512000) at malloc.c:4385
> #1  0x00002ae6b18ea0c0 in opal_memory_ptmalloc2_free (mem=0x2fd0637)
> at malloc.c:3511
> #2  0x00002ae6b18ea736 in opal_memory_linux_free_hook
> (__ptr=0x2fd0637, caller=0x203a746f74512000) at hooks.c:705
> #3  0x0000000001412fcc in for_dealloc_allocatable ()
> #4  0x00000000007767b1 in ALLOC::dealloc_d2 (array=@0x2fd0647,
> name=@0x6f6e6f69006f6e78, routine=Cannot access memory at address 0x0
> ) at alloc.F90:1357
> #5  0x000000000082628c in M_LDAU::hubbard_term (scell=..., nua=@0xd5,
> na=@0xd5, isa=..., xa=..., indxua=..., maxnh=@0xcf4ff, maxnd=@0xcf4ff,
> lasto=..., iphorb=...,
>    numd=..., listdptr=..., listd=..., numh=..., listhptr=...,
> listh=..., nspin=@0xcf4ff00000002, dscf=..., eldau=@0x0, deldau=@0x0,
> fa=..., stress=..., h=...,
>    first=@0x0, last=@0x0) at ldau.F:752
> #6  0x00000000006cd532 in M_SETUP_HAMILTONIAN::setup_hamiltonian
> (first=@0x0, last=@0x0, iscf=@0x2) at setup_hamiltonian.F:199
> #7  0x000000000070e257 in M_SIESTA_FORCES::siesta_forces
> (istep=@0xf9a4d07000000000) at siesta_forces.F:90
> #8  0x000000000070e475 in siesta () at siesta.F:23
> #9  0x000000000045e47c in main ()
> 
> Can anybody shed some light here on what could be wrong?
> 
> Thanks,
> 
> Yong Qin
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Re: [OMPI users] SIGSEGV in OMPI 1.6.x

Reply via email to