Re: Small fixes needed by high-availability tools

Andrey Borodin Mon, 12 May 2025 08:42:44 -0700


> On 6 May 2025, at 12:00, Matthias van de Meent <boekewurm+postg...@gmail.com> 
> wrote:
> 
> On Fri, 2 May 2025 at 15:00, Andrey Borodin <x4...@yandex-team.ru> wrote:
>> 
>> Hi hackers!
>> 
>> I want to revive attempts to fix some old edge cases of physical quorum 
>> replication.
>> 
>> Please find attached draft patches that demonstrate ideas. These patches are 
>> not actually proposed code changes, I rather want to have a design consensus 
>> first.
> [...]
>> 2. Do not allow to cancel locally written transaction
>> 
>> The problem was discussed many times [0,1,2,3] with some agreement on taken 
>> approach. But there was concerns that the solution is incomplete without 
>> first patch in the current thread.
> 
> I'm trying to figure out where in the thread you find this this "some
> agreement". Could you reference the posts you're referring to?

https://www.postgresql.org/message-id/flat/CAAhFRxgYTEqfsqj5MJ0XXGjTUvVfLFxQmD317Bw5vYBHe_sBvQ%40mail.gmail.com#0a984dbbd786f547d366c2ca09ba87d2

> 
>> Problem: user might try to cancel locally committed transaction and if we do 
>> so we will show non-replicated data as committed. This leads to loosing data 
>> with UPSERTs.
> 
> Could you explain why specifically UPSERTs would lose data (vs any
> other user workload) in cancellations during SyncRepWaitForLSN?

Upserts change data conditionally. That's where observed effect affect writtned 
data. But the root problem is observing non-replicated data, it only becomes 
obvious when issuing: "INSERT ON CONFLICT DO NOTHING" and retrying it.
1. INSERT ON CONFLICT DO NOTHING hangs on waiting for replication
2. JDBC cancels query by after default timeout
3. INSERT ON CONFLICT DO NOTHING succeeds, because there's no WAL written

> 
>> The key change is how we process cancels in SyncRepWaitForLSN().
> 
> I personally think we should rather move to CSN-based snapshots on
> both primary and replica (with LSN as CSN), and make visibility of
> other transactions depend on how much persistence your session wants
> (SQL spec permitting, of course).

CSN is a snapshot technique and does not affect sync rep in durability aspect. 
You still WAL-log xid commit.

> I.e., if you have synchronous_commit=remote_apply, you wait with
> sending the commit success message until you have confirmation that
> your commit LSN has been applied on the configured amount of replicas,
> and snapshots are taken based on the latest LSN that is known to be
> applied everywhere, but if you have synchronous_commit=off, you could
> read the commits (even those committed in s_c=remote_apply sessions)
> immediately after they've been included in the logs (potentially with
> some added slack to account for system state updates).

Introducing dependency of snapshot on synchronous_commit level is the 
interesting idea, but it still depends on that cancel cannot make effect of 
transaction visible. It does not contradict ideas that I propose here, but 
support it.

CSN is discussed for a couple of decades already, anything makes you believe it 
will arrive soon and we do not to fix existing problems?

> Similarly, all snapshots you create in a backend with
> synchronous_commit=remote_apply would use the highest LSN which is
> remotely applied according to the applicable rules, while
> synchronous_commit=off implies "all transactions which have been
> logged as committed".
> Changing synchronous_commit to a value that requires higher
> persistence level would cause the backend to wait for its newest
> snapshot LSN to reach that persistence level; IMO an acceptable
> trade-off for switching s_c at runtime.
> 
> This is based on the assumption that if you don't want your commit to
> be very durable, you probably also don't care as much about the
> durability of the data you can see, and if you want your commits to be
> very durable, you probably want to see only very durable data.
> 
> This would also unify the commit visibility order between primary and
> secondary nodes, and would allow users to have session-level 'wait for
> LSN x to be persistent' with much reduced lock times.

That's an interesting problem, but by far less urgent. Reads from standbys 
never were consistent, but durability is promised to users.

> 
> (CC-ed to Ants, given his interest in this topic)

Thanks!

> On 12 May 2025, at 05:40, Amit Kapila <amit.kapil...@gmail.com> wrote:
> 
> One idea to solve this problem could be that whenever we cancel
> sync_rep_wait, we set some system-wide flag that indicates that any
> new transaction must ensure that all the current data is replicated to
> the synchronous standby. Once we ensure that we have waited for
> pending transactions to replicate, we can toggle back that system-wide
> flag. Now, if the system restarts for any reason during such a wait,
> we can use your idea to disallow new connections until the standby
> quorum is established.

This flag would be very contentious. We need to store it durably, it's costly. 
And on busy system we almost always have some transactions waiting for SyncRep, 
so without a flag you know something was in progress.

I'm attaching slightly modified patch set. Now recovery point is correctly 
stored in shared memory.

Best regards, Andrey Borodin.

v2-0001-Allow-checking-standby-sync-before-making-data-vi.patch
Description: Binary data

v2-0002-Do-not-allow-to-cancel-locally-written-transactio.patch
Description: Binary data

Re: Small fixes needed by high-availability tools

Reply via email to