Re: [HACKERS] recent deadlock regression test failures

Kevin Grittner Sat, 08 Apr 2017 18:15:45 -0700

On Sat, Apr 8, 2017 at 12:56 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.mu...@enterprisedb.com> writes:
>> On Sat, Apr 8, 2017 at 4:22 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:


>> Based on the above, here is a version that introduces a simple boolean
>> function pg_waiting_for_safe_snapshot(pid) and adds that the the
>> query.  This was my "option 1" upthread.
>
> Nah, this is not good either.  Yes, it's a fairly precise conversion
> of what Kevin's patch did, but I think that patch is wrong even
> without considering the performance angle.  If you look back at the
> discussions in Feb 2016 that led to what we had, it turned out to be
> important not just to be able to say that process X is blocked, but
> that it is blocked by one of the other isolationtest sessions, and
> not by some random third party (like autovacuum).  I do not know
> whether there is, right now, any way for autovacuum to be the guilty
> party for a SafeSnapshot wait --- but it does not seem like a good
> plan to assume that there never will be.

It would make no sense to run autovacuum at the serializable
transaction isolation level, and only overlapping read-write
serializable transactions can block the attempt to acquire a safe
snapshot.

> So I think what we need is functionality very much like what you had
> in the prior patch, ie identify the specific PIDs that are causing
> process X to wait for a safe snapshot.  I'm just not happy with how
> you packaged it.
>
> Here's a sketch of what I think we ought to do:
>
> 1. Leave pg_blocking_pids() alone; it's a published API now.

Fair enough.

> 2. Add GetSafeSnapshotBlockingPids() more or less as you had it
> in the previous patch (+ Kevin's recommendations).  There might be
> value in providing a SQL-level way to call that, but I'm not sure,
> and it would be independent of fixing isolationtester anyway.

It seems entirely plausible that someone might want to know what is
holding up the start of a backup or large report which uses the READ
ONLY DEFERRABLE option, so I think there is a real use case for a
documented SQL function similar to pg_blocking_pids() to show the
pids of connections currently running transactions which are holding
things up.  Of course, they may not initially know whether it is
being blocked by heavyweight locks or concurrent serializable
read-write transactions, but it should not be a big deal to run two
separate functions.

Given the inability to run isolation tests to cover the deferrable
code, we used a variation on DBT-2 that could cause serialization
anomalies to generate a high concurrency saturation run using
serializable transactions, and started a SERIALIZABLE READ ONLY
DEFERRABLE transaction 1200 times competing with this load, timing
how long it took to start.  To quote the VLDB paper[1], "The median
latency was 1.98 seconds, with 90% of transactions able to obtain a
safe snapshot within 6 seconds, and all within 20 seconds. Given the
intended use (long-running transactions), we believe this delay is
reasonable."  That said, a single long-running serializable
read-write transaction could hold up the attempt indefinitely --
there is no maximum bound.  Hence the benefit of a SQL function to
find out what's happening.

> 3. Invent a SQL function that is dedicated specifically to supporting
> isolationtester and need not be documented at all; this gets us out
> of the problem of whether it's okay to whack its semantics around
> anytime somebody thinks of something else they want to test.
> I'm imagining an API like
>
>           isolation_test_is_waiting_for(int, int[]) returns bool
>
> so that isolationtester's query would reduce to something like
>
>         SELECT pg_catalog.isolation_test_is_waiting_for($1, '{...}')
>
> which would take even more cycles out of the parse/plan overhead for it
> (which is basically what's killing the CCA animals).  Internally, this
> function would call pg_blocking_pids and, if necessary,
> GetSafeSnapshotBlockingPids, and would check for matches in its array
> argument directly; it could probably do that significantly faster than the
> general-purpose array && code.  So we'd have to expend a bit of backend C
> code, but we'd have something that would be quite speedy and we could
> customize in future without fear of breaking behavior that users are
> depending on.

Good suggestion.

Thomas, would you like to produce a patch along these lines, or
should I?

--
Kevin Grittner

[1]  Dan R. K. Ports and Kevin Grittner. 2012.
     Serializable Snapshot Isolation in PostgreSQL.
     Proceedings of the VLDB Endowment, Vol. 5, No. 12.
     The 38th International Conference on Very Large Data Bases,
     August 27th - 31st 2012, Istanbul, Turkey.
     http://vldb.org/pvldb/vol5/p1850_danrkports_vldb2012.pdf


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] recent deadlock regression test failures

Reply via email to