Thanks Jeff. I will definitely do the failure analysis. But just wanted to confirm this isn't something special in OMPI itself, e.g., missing some configuration settings, etc.
On Thu, Sep 6, 2012 at 5:01 AM, Jeff Squyres <jsquy...@cisco.com> wrote: > If you run into a segv in this code, it almost certainly means that you have > heap corruption somewhere. FWIW, that has *always* been what it meant when > I've run into segv's in any code under in opal/mca/memory/linux/. Meaning: > my user code did something wrong, it created heap corruption, and then later > some malloc() or free() caused a segv in this area of the code. > > This code is the same ptmalloc memory allocator that has shipped in glibc for > years. I'll be hard-pressed to say that any code is 100% bug free :-), but > I'd be surprised if there is a bug in this particular chunk of code. > > I'd run your code through valgrind or some other memory-checking debugger and > see if that can shed any light on what's going on. > > > On Sep 6, 2012, at 12:06 AM, Yong Qin wrote: > >> Hi, >> >> While debugging a mysterious crash of a code, I was able to trace down >> to a SIGSEGV in OMPI 1.6 and 1.6.1. The offending code is in >> opal/mca/memory/linux/malloc.c. Please see the following gdb log. >> >> (gdb) c >> Continuing. >> >> Program received signal SIGSEGV, Segmentation fault. >> opal_memory_ptmalloc2_int_free (av=0x2fd0637, mem=0x203a746f74512000) >> at malloc.c:4385 >> 4385 nextsize = chunksize(nextchunk); >> (gdb) l >> 4380 Consolidate other non-mmapped chunks as they arrive. >> 4381 */ >> 4382 >> 4383 else if (!chunk_is_mmapped(p)) { >> 4384 nextchunk = chunk_at_offset(p, size); >> 4385 nextsize = chunksize(nextchunk); >> 4386 assert(nextsize > 0); >> 4387 >> 4388 /* consolidate backward */ >> 4389 if (!prev_inuse(p)) { >> (gdb) bt >> #0 opal_memory_ptmalloc2_int_free (av=0x2fd0637, >> mem=0x203a746f74512000) at malloc.c:4385 >> #1 0x00002ae6b18ea0c0 in opal_memory_ptmalloc2_free (mem=0x2fd0637) >> at malloc.c:3511 >> #2 0x00002ae6b18ea736 in opal_memory_linux_free_hook >> (__ptr=0x2fd0637, caller=0x203a746f74512000) at hooks.c:705 >> #3 0x0000000001412fcc in for_dealloc_allocatable () >> #4 0x00000000007767b1 in ALLOC::dealloc_d2 (array=@0x2fd0647, >> name=@0x6f6e6f69006f6e78, routine=Cannot access memory at address 0x0 >> ) at alloc.F90:1357 >> #5 0x000000000082628c in M_LDAU::hubbard_term (scell=..., nua=@0xd5, >> na=@0xd5, isa=..., xa=..., indxua=..., maxnh=@0xcf4ff, maxnd=@0xcf4ff, >> lasto=..., iphorb=..., >> numd=..., listdptr=..., listd=..., numh=..., listhptr=..., >> listh=..., nspin=@0xcf4ff00000002, dscf=..., eldau=@0x0, deldau=@0x0, >> fa=..., stress=..., h=..., >> first=@0x0, last=@0x0) at ldau.F:752 >> #6 0x00000000006cd532 in M_SETUP_HAMILTONIAN::setup_hamiltonian >> (first=@0x0, last=@0x0, iscf=@0x2) at setup_hamiltonian.F:199 >> #7 0x000000000070e257 in M_SIESTA_FORCES::siesta_forces >> (istep=@0xf9a4d07000000000) at siesta_forces.F:90 >> #8 0x000000000070e475 in siesta () at siesta.F:23 >> #9 0x000000000045e47c in main () >> >> Can anybody shed some light here on what could be wrong? >> >> Thanks, >> >> Yong Qin >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users