Re: DSM robustness failure (was Re: Peripatus/failures)

Thomas Munro Wed, 17 Oct 2018 19:59:44 -0700

On Thu, Oct 18, 2018 at 2:36 PM Tom Lane <t...@sss.pgh.pa.us> wrote:
> Larry's REL_10_STABLE failure logs are interesting:
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=peripatus&dt=2018-10-17%2020%3A42%3A17
>
> 2018-10-17 15:48:08.849 CDT [55240:7] LOG:  dynamic shared memory control 
> segment is corrupt
> 2018-10-17 15:48:08.849 CDT [55240:8] LOG:  sem_destroy failed: Invalid 
> argument
> 2018-10-17 15:48:08.850 CDT [55240:9] LOG:  sem_destroy failed: Invalid 
> argument
> 2018-10-17 15:48:08.850 CDT [55240:10] LOG:  sem_destroy failed: Invalid 
> argument
> 2018-10-17 15:48:08.850 CDT [55240:11] LOG:  sem_destroy failed: Invalid 
> argument
> ... lots more ...
> 2018-10-17 15:48:08.862 CDT [55240:122] LOG:  sem_destroy failed: Invalid 
> argument
> 2018-10-17 15:48:08.862 CDT [55240:123] LOG:  sem_destroy failed: Invalid 
> argument
> TRAP: FailedAssertion("!(dsm_control_mapped_size == 0)", File: "dsm.c", Line: 
> 182)
>
> So at least in this case, not only did we lose the DSM segment but also
> all of our semaphores.  Is it conceivable that Python somehow destroyed
> those objects, rather than stomping on the contents of the DSM segment?
> If not, how do we explain this log?


One idea:  In the backend I'm looking at there is a contiguous run of
read/write mappings from the the location of the semaphore array
through to the DSM control segment.  That means that a single runaway
loop/memcpy/memset etc could overwrite both of those.  Eventually it
would run off the end of contiguously mapped space and SEGV, and we do
indeed see a segfault from that Python code before the trouble begins.

> Also, why is there branch-specific variation?  The fact that v11 and HEAD
> aren't whinging about lost semaphores is not hard to understand --- we
> stopped using SysV semas.  But why don't the older branches look like v10
> here?

I think v10 is where we switched to POSIX unnamed (= sem_destroy()),
so it's 10, 11 and master that should be the same in this respect, no?

-- 
Thomas Munro
http://www.enterprisedb.com

Re: DSM robustness failure (was Re: Peripatus/failures)

Reply via email to