On Tue, Feb 12, 2019 at 1:51 AM Sergei Kornilov <s...@zsrv.org> wrote: > > Here's confirmed steps to reproduce > > Wow, i confirm this testcase is reproducible for me. On my 4-core desktop i > see "dsa_area could not attach to segment" error after minute or two.
Well that's something -- thanks for this report. I've had 3 different machines (laptops and servers, with an without optimisation enabled, clang and gcc, 3 different OSes) grinding away on Justin's test case for many hours today, without seeing the problem. > On current REL_11_STABLE branch with PANIC level i see this backtrace for > failed parallel process: > > #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50 > #1 0x00007f3b36983535 in __GI_abort () at abort.c:79 > #2 0x000055f03ab87a4e in errfinish (dummy=dummy@entry=0) at elog.c:555 > #3 0x000055f03ab899e0 in elog_finish (elevel=elevel@entry=22, > fmt=fmt@entry=0x55f03ad86900 "dsa_area could not attach to segment") at > elog.c:1376 > #4 0x000055f03abaa1e2 in get_segment_by_index > (area=area@entry=0x55f03cdd6bf0, index=index@entry=7) at dsa.c:1743 > #5 0x000055f03abaa8ab in get_best_segment (area=area@entry=0x55f03cdd6bf0, > npages=npages@entry=8) at dsa.c:1993 > #6 0x000055f03ababdb8 in dsa_allocate_extended (area=0x55f03cdd6bf0, > size=size@entry=32768, flags=flags@entry=0) at dsa.c:701 Ok, this contains some clues I didn't have before. Here we see that a request for a 32KB chunk of memory led to a traversal the linked list of segments in a given bin, and at some point we followed a link to segment index number 7, which turned out to be bogus. We tried to attach to the segment whose handle is stored in area->control->segment_handles[7] and it was not known to dsm.c. It wasn't DSM_HANDLE_INVALID, or you'd have got a different error message. That means that it wasn't a segment that had been freed by destroy_superblock(), or it'd hold DSM_HANDLE_INVALID. Hmm. So perhaps the bin list was corrupted (the segment index was bad due to some bogus list manipulation logic or memory overrun or...), or we corrupted our array of handles, or there is some missing locking somewhere (all bin manipulation and traversal should be protected by the area lock), or a valid DSM handle was unexpectedly missing (dsm.c bug, bogus shm_open() EEXIST from the OS). Can we please see the stderr output of dsa_dump(area), added just before the PANIC? Can we see the value of "handle" when the error is raised, and the directory listing for /dev/shm (assuming Linux) after the crash (maybe you need restart_after_crash = off to prevent automatic cleanup)? -- Thomas Munro http://www.enterprisedb.com