On Fri, Sep 13, 2024 at 11:38 AM Amit Kapila <amit.kapil...@gmail.com> wrote: > > > > > > > So in brief, this solution is only for bidrectional setup? For > > > non-bidirectional, > > > feedback_slots is non-configurable and thus irrelevant. > > > > Right. > > > > One possible idea to address the non-bidirectional case raised by > Shveta is to use a time-based cut-off to remove dead tuples. As > mentioned earlier in my email [1], we can define a new GUC parameter > say vacuum_committs_age which would indicate that we will allow rows > to be removed only if the modified time of the tuple as indicated by > committs module is greater than the vacuum_committs_age. We could keep > this parameter a table-level option without introducing a GUC as this > may not apply to all tables. I checked and found that some other > replication solutions like GoldenGate also allowed similar parameters > (tombstone_deletes) to be specified at table level [2]. The other > advantage of allowing it at table level is that it won't hamper the > performance of hot-pruning or vacuum in general. Note, I am careful > here because to decide whether to remove a dead tuple or not we need > to compare its committs_time both during hot-pruning and vacuum.
+1 on the idea, but IIUC this value doesn’t need to be significant; it can be limited to just a few minutes. The one which is sufficient to handle replication delays caused by network lag or other factors, assuming clock skew has already been addressed. This new parameter is necessary only for cases where an UPDATE and DELETE on the same row occur concurrently, but the replication order to a third node is not preserved, which could result in data divergence. Consider the following example: Node A: T1: INSERT INTO t (id, value) VALUES (1,1); (10.01 AM) T2: DELETE FROM t WHERE id = 1; (10.03 AM) Node B: T3: UPDATE t SET value = 2 WHERE id = 1; (10.02 AM) Assume a third node (Node C) subscribes to both Node A and Node B. The "correct" order of messages received by Node C would be T1-T3-T2, but it could also receive them in the order T1-T2-T3, wherein sayT3 is received with a lag of say 2 mins. In such a scenario, T3 should be able to recognize that the row was deleted by T2 on Node C, thereby detecting the update-deleted conflict and skipping the apply. The 'vacuum_committs_age' parameter should account for this lag, which could lead to the order reversal of UPDATE and DELETE operations. Any subsequent attempt to update the same row after conflict detection and resolution should not pose an issue. For example, if Node A triggers the following at 10:20 AM: UPDATE t SET value = 3 WHERE id = 1; Since the row has already been deleted, the UPDATE will not proceed and therefore will not generate a replication operation on the other nodes, indicating that vacuum need not to preserve the dead row to this far. thanks Shveta