Re: Small fixes needed by high-availability tools

Ants Aasma Mon, 12 May 2025 11:18:33 -0700

Hi, dropping in my 2 cents here.

On Mon, 12 May 2025 at 18:42, Andrey Borodin <x4...@yandex-team.ru> wrote:
> >> Problem: user might try to cancel locally committed transaction and if we 
> >> do so we will show non-replicated data as committed. This leads to loosing 
> >> data with UPSERTs.
> >
> > Could you explain why specifically UPSERTs would lose data (vs any
> > other user workload) in cancellations during SyncRepWaitForLSN?
>
> Upserts change data conditionally. That's where observed effect affect 
> writtned data. But the root problem is observing non-replicated data, it only 
> becomes obvious when issuing: "INSERT ON CONFLICT DO NOTHING" and retrying it.
> 1. INSERT ON CONFLICT DO NOTHING hangs on waiting for replication
> 2. JDBC cancels query by after default timeout
> 3. INSERT ON CONFLICT DO NOTHING succeeds, because there's no WAL written


Right. I think upsert is a red herring here. Any system trying to
implement idempotency/exactly once delivery will be built around a
similar pattern. Check if a transaction has already been executed, if
not run the transaction, commit, on failure retry. This is
particularly vulnerable to the visibility issue because the retry is
likely to land on the partitioned off leader.

> >
> >> The key change is how we process cancels in SyncRepWaitForLSN().
> >
> > I personally think we should rather move to CSN-based snapshots on
> > both primary and replica (with LSN as CSN), and make visibility of
> > other transactions depend on how much persistence your session wants
> > (SQL spec permitting, of course).
>
> CSN is a snapshot technique and does not affect sync rep in durability 
> aspect. You still WAL-log xid commit.

CSN based snapshots enable delaying visibility without blocking on
cancelling a commit, and relatedly having async commits remain
invisible. List of concurrent xids snapshots require shared memory to
keep track of which transactions are running and are therefore limited
in size, running a transaction, commiting and then cancelling allows
for a potentially unlimited amount of concurrent transactions.

> > I.e., if you have synchronous_commit=remote_apply, you wait with
> > sending the commit success message until you have confirmation that
> > your commit LSN has been applied on the configured amount of replicas,
> > and snapshots are taken based on the latest LSN that is known to be
> > applied everywhere, but if you have synchronous_commit=off, you could
> > read the commits (even those committed in s_c=remote_apply sessions)
> > immediately after they've been included in the logs (potentially with
> > some added slack to account for system state updates).
>
> Introducing dependency of snapshot on synchronous_commit level is the 
> interesting idea, but it still depends on that cancel cannot make effect of 
> transaction visible. It does not contradict ideas that I propose here, but 
> support it.
>
> CSN is discussed for a couple of decades already, anything makes you believe 
> it will arrive soon and we do not to fix existing problems?

A couple of things give me hope. One is Heikki's approach of adding a
xid visibility cache to the snapshot [1], which proved to be
surprisingly effective.

The other is having a resolution in sight on how to handle async
transaction visibility. My recollection is that this was the major
issue that derailed the feature last time it was attempted. Allowing
different users to see different states based on their durability
requirements looks like a satisfactory answer to this problem. Whether
it's by overloading synchronous_commit, or a new guc, or a transaction
isolation parameter is a small matter of bikeshedding.

There is even an interesting paper on how this type of approach can be
used to reduce lock durations in contended workloads by moving the
wait to readers [2].

Third, the recent Jepsen report seems to have renewed wider interest
in this problem. [3]

In a related topic, I'm also looking at tail latencies with
synchronous replication. Some storage devices have occasional hiccups,
and because WAL necessarily causes head of line blocking, which can
magnify the problem by multiple orders of magnitude. Right now we
don't even try to replicate before WAL is flushed locally. Ideally I
would like to support quorum commit to return when a transaction is
not yet persistent on the local disk. This will require some
additional awareness on PostgreSQL side that it is running as part of
a cluster, similarly to the no visibility before replicated durability
problem we are discussing here.

Regards,
Ants Aasma

[1] 
https://www.postgresql.org/message-id/flat/80f254d3-8ee9-4cde-a7e3-ee99998154da%40iki.fi#8a550e2adaa6810e25c497f24a2a83fd
[2] https://cs.uwaterloo.ca/~kdaudjee/ED.pdf
[3] https://jepsen.io/analyses/amazon-rds-for-postgresql-17.4

Re: Small fixes needed by high-availability tools

Reply via email to