On Thu, Oct 18, 2018 at 2:36 PM Tom Lane <t...@sss.pgh.pa.us> wrote: > Larry's REL_10_STABLE failure logs are interesting: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=peripatus&dt=2018-10-17%2020%3A42%3A17 > > 2018-10-17 15:48:08.849 CDT [55240:7] LOG: dynamic shared memory control > segment is corrupt > 2018-10-17 15:48:08.849 CDT [55240:8] LOG: sem_destroy failed: Invalid > argument > 2018-10-17 15:48:08.850 CDT [55240:9] LOG: sem_destroy failed: Invalid > argument > 2018-10-17 15:48:08.850 CDT [55240:10] LOG: sem_destroy failed: Invalid > argument > 2018-10-17 15:48:08.850 CDT [55240:11] LOG: sem_destroy failed: Invalid > argument > ... lots more ... > 2018-10-17 15:48:08.862 CDT [55240:122] LOG: sem_destroy failed: Invalid > argument > 2018-10-17 15:48:08.862 CDT [55240:123] LOG: sem_destroy failed: Invalid > argument > TRAP: FailedAssertion("!(dsm_control_mapped_size == 0)", File: "dsm.c", Line: > 182) > > So at least in this case, not only did we lose the DSM segment but also > all of our semaphores. Is it conceivable that Python somehow destroyed > those objects, rather than stomping on the contents of the DSM segment? > If not, how do we explain this log?
One idea: In the backend I'm looking at there is a contiguous run of read/write mappings from the the location of the semaphore array through to the DSM control segment. That means that a single runaway loop/memcpy/memset etc could overwrite both of those. Eventually it would run off the end of contiguously mapped space and SEGV, and we do indeed see a segfault from that Python code before the trouble begins. > Also, why is there branch-specific variation? The fact that v11 and HEAD > aren't whinging about lost semaphores is not hard to understand --- we > stopped using SysV semas. But why don't the older branches look like v10 > here? I think v10 is where we switched to POSIX unnamed (= sem_destroy()), so it's 10, 11 and master that should be the same in this respect, no? -- Thomas Munro http://www.enterprisedb.com