Re: [OMPI users] SIGSEGV in OMPI 1.6.x

Yong Qin Thu, 6 Sep 2012 12:52:27 -0400

Thanks Jeff. I will definitely do the failure analysis. But just
wanted to confirm this isn't something special in OMPI itself, e.g.,
missing some configuration settings, etc.



On Thu, Sep 6, 2012 at 5:01 AM, Jeff Squyres <jsquy...@cisco.com> wrote:
> If you run into a segv in this code, it almost certainly means that you have 
> heap corruption somewhere.  FWIW, that has *always* been what it meant when 
> I've run into segv's in any code under in opal/mca/memory/linux/.  Meaning: 
> my user code did something wrong, it created heap corruption, and then later 
> some malloc() or free() caused a segv in this area of the code.
>
> This code is the same ptmalloc memory allocator that has shipped in glibc for 
> years.  I'll be hard-pressed to say that any code is 100% bug free :-), but 
> I'd be surprised if there is a bug in this particular chunk of code.
>
> I'd run your code through valgrind or some other memory-checking debugger and 
> see if that can shed any light on what's going on.
>
>
> On Sep 6, 2012, at 12:06 AM, Yong Qin wrote:
>
>> Hi,
>>
>> While debugging a mysterious crash of a code, I was able to trace down
>> to a SIGSEGV in OMPI 1.6 and 1.6.1. The offending code is in
>> opal/mca/memory/linux/malloc.c. Please see the following gdb log.
>>
>> (gdb) c
>> Continuing.
>>
>> Program received signal SIGSEGV, Segmentation fault.
>> opal_memory_ptmalloc2_int_free (av=0x2fd0637, mem=0x203a746f74512000)
>> at malloc.c:4385
>> 4385          nextsize = chunksize(nextchunk);
>> (gdb) l
>> 4380           Consolidate other non-mmapped chunks as they arrive.
>> 4381        */
>> 4382
>> 4383        else if (!chunk_is_mmapped(p)) {
>> 4384          nextchunk = chunk_at_offset(p, size);
>> 4385          nextsize = chunksize(nextchunk);
>> 4386          assert(nextsize > 0);
>> 4387
>> 4388          /* consolidate backward */
>> 4389          if (!prev_inuse(p)) {
>> (gdb) bt
>> #0  opal_memory_ptmalloc2_int_free (av=0x2fd0637,
>> mem=0x203a746f74512000) at malloc.c:4385
>> #1  0x00002ae6b18ea0c0 in opal_memory_ptmalloc2_free (mem=0x2fd0637)
>> at malloc.c:3511
>> #2  0x00002ae6b18ea736 in opal_memory_linux_free_hook
>> (__ptr=0x2fd0637, caller=0x203a746f74512000) at hooks.c:705
>> #3  0x0000000001412fcc in for_dealloc_allocatable ()
>> #4  0x00000000007767b1 in ALLOC::dealloc_d2 (array=@0x2fd0647,
>> name=@0x6f6e6f69006f6e78, routine=Cannot access memory at address 0x0
>> ) at alloc.F90:1357
>> #5  0x000000000082628c in M_LDAU::hubbard_term (scell=..., nua=@0xd5,
>> na=@0xd5, isa=..., xa=..., indxua=..., maxnh=@0xcf4ff, maxnd=@0xcf4ff,
>> lasto=..., iphorb=...,
>>    numd=..., listdptr=..., listd=..., numh=..., listhptr=...,
>> listh=..., nspin=@0xcf4ff00000002, dscf=..., eldau=@0x0, deldau=@0x0,
>> fa=..., stress=..., h=...,
>>    first=@0x0, last=@0x0) at ldau.F:752
>> #6  0x00000000006cd532 in M_SETUP_HAMILTONIAN::setup_hamiltonian
>> (first=@0x0, last=@0x0, iscf=@0x2) at setup_hamiltonian.F:199
>> #7  0x000000000070e257 in M_SIESTA_FORCES::siesta_forces
>> (istep=@0xf9a4d07000000000) at siesta_forces.F:90
>> #8  0x000000000070e475 in siesta () at siesta.F:23
>> #9  0x000000000045e47c in main ()
>>
>> Can anybody shed some light here on what could be wrong?
>>
>> Thanks,
>>
>> Yong Qin
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] SIGSEGV in OMPI 1.6.x

Reply via email to