Hi hackers! I want to revive attempts to fix some old edge cases of physical quorum replication.
Please find attached draft patches that demonstrate ideas. These patches are not actually proposed code changes, I rather want to have a design consensus first. 1. Allow checking standby sync before making data visible after crash recovery Problem: Postgres instance must not allow to read data, if it is not yet known to be replicated. Instantly after the crash we do not know if we are still cluster primary. We can disallow new connections until standby quorum is established. Of course, walsenders and superusers must be exempt from this restriction. Key change is following: @@ -1214,6 +1215,16 @@ InitPostgres(const char *in_dbname, Oid dboid, if (PostAuthDelay > 0) pg_usleep(PostAuthDelay * 1000000L); + /* Check if we need to wait for startup synchronous replication */ + if (!am_walsender && + !superuser() && + !StartupSyncRepEstablished()) + { + ereport(FATAL, + (errcode(ERRCODE_CANNOT_CONNECT_NOW), + errmsg("cannot connect until synchronous replication is established with standbys according to startup_synchronous_standby_level"))); + } We might also want to have some kind of cache that quorum was already established. Also the place where the check is done might be not most appropriate. 2. Do not allow to cancel locally written transaction The problem was discussed many times [0,1,2,3] with some agreement on taken approach. But there was concerns that the solution is incomplete without first patch in the current thread. Problem: user might try to cancel locally committed transaction and if we do so we will show non-replicated data as committed. This leads to loosing data with UPSERTs. The key change is how we process cancels in SyncRepWaitForLSN(). 3. Allow reading LSN written by walreciever, but not flushed yet Problem: if we have synchronous_standby_names = ANY(node1,node2), node2 might be ahead of node1 by flush LSN, but before by written LSN. If we do a failover we choose node2 instead of node1 and loose data recently committed with synchronous_commit=remote_write. Caveat: we already have a function pg_last_wal_receive_lsn(), which in fact returns flushed LSN, not written. I propose to add a new function which returns LSN actually written. Internals of this function are already implemented (GetWalRcvWriteRecPtr()), but unused. Currently we just use a separate program lwaldump [4] which just reads WAL until last valid record. In case of failover pg_consul uses LSNs from lwaldump. This approach works well, but is cumbersome. There are other caveats of replication, but IMO these 3 problems are most annoying in terms of data durability. I'd greatly appreciate any thoughts on this. Best regards, Andrey Borodin. [0] https://www.postgresql.org/message-id/flat/C1F7905E-5DB2-497D-ABCC-E14D4DEE506C%40yandex-team.ru [1] https://www.postgresql.org/message-id/flat/caeet0zhg5off7iecby6tzadh1moslmfz1hlm311p9vot7z+...@mail.gmail.com [2] https://www.postgresql.org/message-id/flat/6a052e81060824a8286148b1165bafedbd7c86cd.ca...@j-davis.com#415dc2f7d41b8a251b419256407bb64d [3] https://www.postgresql.org/message-id/flat/CALj2ACUrOB59QaE6%3DjF2cFAyv1MR7fzD8tr4YM5%2BOwEYG1SNzA%40mail.gmail.com [4] https://github.com/g0djan/lwaldump
0001-Allow-checking-standby-sync-before-making-data-visib.patch
Description: Binary data
0002-Do-not-allow-to-cancel-locally-written-transaction.patch
Description: Binary data
0003-Allow-reading-LSN-written-by-walreciever-but-not-flu.patch
Description: Binary data