Hi, hackers!

This is continuation of thread [0] in pgsql-general with proposed changes. As 
Maksim pointed out, this topic was raised before here [1].

Currently, we can have split brain with the combination of following steps:
0. Setup cluster with synchronous replication. Isolate primary from standbys.
1. Issue upsert query INSERT .. ON CONFLICT DO NOTHING
2. CANCEL 1 during wait for synchronous replication
3. Retry 1. Idempotent query will succeed and client have confirmation of 
written data, while it is not present on any standby.

Thread [0] contain reproduction from psql.

In certain situations we cannot avoid cancelation of timed out queries. Yes, we 
can interpret warnings and thread them as errors, but warning is emitted on 
step 1, not on step 3.

I think proper solution here would be to add GUC to disallow cancellation of 
synchronous replication. Retry step 3 will wait on locks after hanging 1 and 
data will be consistent.
Three is still a problem when backend is not canceled, but terminated [2]. 
Ideal solution would be to keep locks on changed data. Some well known 
databases threat termination of synchronous replication as system failure and 
refuse to operate until standbys appear (see Maximum Protection mode). From my 
point of view it's enough to PANIC once so that HA tool be informed that 
something is going wrong.
Anyway situation with cancelation is more dangerous. We've observed it in some 
user cases.

Please find attached draft of proposed change.

Best regards, Andrey Borodin.

[0] 
https://www.postgresql.org/message-id/flat/B70260F9-D0EC-438D-9A59-31CB996B320A%40yandex-team.ru
[1] 
https://www.postgresql.org/message-id/flat/CAEET0ZHG5oFF7iEcbY6TZadh1mosLmfz1HLm311P9VOt7Z%2Bjeg%40mail.gmail.com
[2] 
https://www.postgresql.org/docs/current/warm-standby.html#SYNCHRONOUS-REPLICATION-HA

Attachment: 0001-Disallow-cancelation-of-syncronous-commit-V1.patch
Description: Binary data

Reply via email to