Small fixes needed by high-availability tools

Andrey Borodin Fri, 02 May 2025 06:01:05 -0700

Hi hackers!

I want to revive attempts to fix some old edge cases of physical quorum 
replication.


Please find attached draft patches that demonstrate ideas. These patches are 
not actually proposed code changes, I rather want to have a design consensus 
first.

1. Allow checking standby sync before making data visible after crash recovery

Problem: Postgres instance must not allow to read data, if it is not yet known 
to be replicated.
Instantly after the crash we do not know if we are still cluster primary. We 
can disallow new
connections until standby quorum is established. Of course, walsenders and 
superusers must be exempt from this restriction.

Key change is following:
@@ -1214,6 +1215,16 @@ InitPostgres(const char *in_dbname, Oid dboid,
        if (PostAuthDelay > 0)
                pg_usleep(PostAuthDelay * 1000000L);
 
+       /* Check if we need to wait for startup synchronous replication */
+       if (!am_walsender &&
+               !superuser() &&
+               !StartupSyncRepEstablished())
+       {
+               ereport(FATAL,
+                               (errcode(ERRCODE_CANNOT_CONNECT_NOW),
+                                errmsg("cannot connect until synchronous 
replication is established with standbys according to 
startup_synchronous_standby_level")));
+       }

We might also want to have some kind of cache that quorum was already 
established. Also the place where the check is done might be not most 
appropriate.

2. Do not allow to cancel locally written transaction

The problem was discussed many times [0,1,2,3] with some agreement on taken 
approach. But there was concerns that the solution is incomplete without first 
patch in the current thread.

Problem: user might try to cancel locally committed transaction and if we do so 
we will show non-replicated data as committed. This leads to loosing data with 
UPSERTs.

The key change is how we process cancels in SyncRepWaitForLSN().

3. Allow reading LSN written by walreciever, but not flushed yet

Problem: if we have synchronous_standby_names = ANY(node1,node2), node2 might 
be ahead of node1 by flush LSN, but before by written LSN. If we do a failover 
we choose node2 instead of node1 and loose data recently committed with 
synchronous_commit=remote_write.

Caveat: we already have a function pg_last_wal_receive_lsn(), which in fact 
returns flushed LSN, not written. I propose to add a new function which returns 
LSN actually written. Internals of this function are already implemented 
(GetWalRcvWriteRecPtr()), but unused.

Currently we just use a separate program lwaldump [4] which just reads WAL 
until last valid record. In case of failover pg_consul uses LSNs from lwaldump. 
This approach works well, but is cumbersome.


There are other caveats of replication, but IMO these 3 problems are most 
annoying in terms of data durability.

I'd greatly appreciate any thoughts on this.


Best regards, Andrey Borodin.


[0] 
https://www.postgresql.org/message-id/flat/C1F7905E-5DB2-497D-ABCC-E14D4DEE506C%40yandex-team.ru
[1] 
https://www.postgresql.org/message-id/flat/caeet0zhg5off7iecby6tzadh1moslmfz1hlm311p9vot7z+...@mail.gmail.com
[2] 
https://www.postgresql.org/message-id/flat/6a052e81060824a8286148b1165bafedbd7c86cd.ca...@j-davis.com#415dc2f7d41b8a251b419256407bb64d
[3] 
https://www.postgresql.org/message-id/flat/CALj2ACUrOB59QaE6%3DjF2cFAyv1MR7fzD8tr4YM5%2BOwEYG1SNzA%40mail.gmail.com
[4] https://github.com/g0djan/lwaldump

0001-Allow-checking-standby-sync-before-making-data-visib.patch
Description: Binary data

0002-Do-not-allow-to-cancel-locally-written-transaction.patch
Description: Binary data

0003-Allow-reading-LSN-written-by-walreciever-but-not-flu.patch
Description: Binary data

Small fixes needed by high-availability tools

Reply via email to