Re: Disallow cancellation of waiting for synchronous replication

Robert Haas Sat, 28 Dec 2019 13:57:02 -0800

On Fri, Dec 20, 2019 at 12:04 AM Andrey Borodin <[email protected]> wrote:
> Currently, we can have split brain with the combination of following steps:
> 0. Setup cluster with synchronous replication. Isolate primary from standbys.
> 1. Issue upsert query INSERT .. ON CONFLICT DO NOTHING
> 2. CANCEL 1 during wait for synchronous replication
> 3. Retry 1. Idempotent query will succeed and client have confirmation of 
> written data, while it is not present on any standby.


It seems to me that in order for synchronous replication to work
reliably, you've got to be very careful about any situation where a
commit might or might not have completed, and this is one such
situation. When the client sends the query cancel, it does not and
cannot know whether the INSERT statement has not yet completed, has
already completed but not yet replicated, or has completed and
replicated but not yet sent back a response. However, the server's
response will be different in each of those cases, because in the
second case, there will be a WARNING about synchronous replication
having been interrupted. If the client ignores that WARNING, there are
going to be problems.

Now, even if you do pay attention to the warning, things are not
totally great here, because if you have inadvertently interrupted a
replication wait, how do you recover? You can't send a command that
means "oh, I want to wait after all." You would have to check the
standbys yourself, from the application code, and see whether the
changes that the query made have shown up there, or check the LSN on
the master and wait for the standbys to advance to that LSN. That's
not great, but might be doable for some applications.

One idea that I had during the initial discussion around synchronous
replication was that maybe there ought to be a distinction between
trying to cancel the query and trying to cancel the replication wait.
Imagine that you could send a cancel that would only cancel
replication waits but not queries, or only queries but not replication
waits. Then you could solve this problem by asking the server to
PQcancelWithAdvancedMagic(conn, PQ_CANCEL_TYPE_QUERY). I wasn't sure
that people would want this, and it didn't seem essential for the
version of this feature, but maybe this example shows that it would be
worthwhile. I don't really have any idea how you'd integrate such a
feature with psql, but maybe it would be good enough to have it
available through the C interface. Also, it's a wire-protocol change,
so there are compatibility issues to think about.

All that being said, like Tom and Michael, I don't think teaching the
backend to ignore cancels is the right approach. We have had
innumerable problems over the years that were caused by the backend
failing to respond to cancels when we would really have liked it to do
so, and users REALLY hate it when you tell them that they have to shut
down and restart (with crash recovery) the entire database because of
a single stuck backend.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Disallow cancellation of waiting for synchronous replication

Reply via email to