Hi, dropping in my 2 cents here. On Mon, 12 May 2025 at 18:42, Andrey Borodin <x4...@yandex-team.ru> wrote: > >> Problem: user might try to cancel locally committed transaction and if we > >> do so we will show non-replicated data as committed. This leads to loosing > >> data with UPSERTs. > > > > Could you explain why specifically UPSERTs would lose data (vs any > > other user workload) in cancellations during SyncRepWaitForLSN? > > Upserts change data conditionally. That's where observed effect affect > writtned data. But the root problem is observing non-replicated data, it only > becomes obvious when issuing: "INSERT ON CONFLICT DO NOTHING" and retrying it. > 1. INSERT ON CONFLICT DO NOTHING hangs on waiting for replication > 2. JDBC cancels query by after default timeout > 3. INSERT ON CONFLICT DO NOTHING succeeds, because there's no WAL written
Right. I think upsert is a red herring here. Any system trying to implement idempotency/exactly once delivery will be built around a similar pattern. Check if a transaction has already been executed, if not run the transaction, commit, on failure retry. This is particularly vulnerable to the visibility issue because the retry is likely to land on the partitioned off leader. > > > >> The key change is how we process cancels in SyncRepWaitForLSN(). > > > > I personally think we should rather move to CSN-based snapshots on > > both primary and replica (with LSN as CSN), and make visibility of > > other transactions depend on how much persistence your session wants > > (SQL spec permitting, of course). > > CSN is a snapshot technique and does not affect sync rep in durability > aspect. You still WAL-log xid commit. CSN based snapshots enable delaying visibility without blocking on cancelling a commit, and relatedly having async commits remain invisible. List of concurrent xids snapshots require shared memory to keep track of which transactions are running and are therefore limited in size, running a transaction, commiting and then cancelling allows for a potentially unlimited amount of concurrent transactions. > > I.e., if you have synchronous_commit=remote_apply, you wait with > > sending the commit success message until you have confirmation that > > your commit LSN has been applied on the configured amount of replicas, > > and snapshots are taken based on the latest LSN that is known to be > > applied everywhere, but if you have synchronous_commit=off, you could > > read the commits (even those committed in s_c=remote_apply sessions) > > immediately after they've been included in the logs (potentially with > > some added slack to account for system state updates). > > Introducing dependency of snapshot on synchronous_commit level is the > interesting idea, but it still depends on that cancel cannot make effect of > transaction visible. It does not contradict ideas that I propose here, but > support it. > > CSN is discussed for a couple of decades already, anything makes you believe > it will arrive soon and we do not to fix existing problems? A couple of things give me hope. One is Heikki's approach of adding a xid visibility cache to the snapshot [1], which proved to be surprisingly effective. The other is having a resolution in sight on how to handle async transaction visibility. My recollection is that this was the major issue that derailed the feature last time it was attempted. Allowing different users to see different states based on their durability requirements looks like a satisfactory answer to this problem. Whether it's by overloading synchronous_commit, or a new guc, or a transaction isolation parameter is a small matter of bikeshedding. There is even an interesting paper on how this type of approach can be used to reduce lock durations in contended workloads by moving the wait to readers [2]. Third, the recent Jepsen report seems to have renewed wider interest in this problem. [3] In a related topic, I'm also looking at tail latencies with synchronous replication. Some storage devices have occasional hiccups, and because WAL necessarily causes head of line blocking, which can magnify the problem by multiple orders of magnitude. Right now we don't even try to replicate before WAL is flushed locally. Ideally I would like to support quorum commit to return when a transaction is not yet persistent on the local disk. This will require some additional awareness on PostgreSQL side that it is running as part of a cluster, similarly to the no visibility before replicated durability problem we are discussing here. Regards, Ants Aasma [1] https://www.postgresql.org/message-id/flat/80f254d3-8ee9-4cde-a7e3-ee99998154da%40iki.fi#8a550e2adaa6810e25c497f24a2a83fd [2] https://cs.uwaterloo.ca/~kdaudjee/ED.pdf [3] https://jepsen.io/analyses/amazon-rds-for-postgresql-17.4