Hi, Thank you for considering another idea.
On Fri, Sep 20, 2024 at 2:46 AM Amit Kapila <amit.kapil...@gmail.com> wrote: > > On Fri, Sep 20, 2024 at 8:25 AM Zhijie Hou (Fujitsu) > <houzj.f...@fujitsu.com> wrote: > > > > Apart from the vacuum_defer_cleanup_age idea. > > > > I think you meant to say vacuum_committs_age idea. > > > we’ve given more thought to our > > approach for retaining dead tuples and have come up with another idea that > > can > > reliably detect conflicts without requiring users to choose a wise value for > > the vacuum_committs_age. This new idea could also reduce the performance > > impact. Thanks a lot to Amit for off-list discussion. > > > > The concept of the new idea is that, the dead tuples are only useful to > > detect > > conflicts when applying *concurrent* transactions from remotes. Any > > subsequent > > UPDATE from a remote node after removing the dead tuples should have a later > > timestamp, meaning it's reasonable to detect an update_missing scenario and > > convert the UPDATE to an INSERT when applying it. > > > > To achieve above, we can create an additional replication slot on the > > subscriber side, maintained by the apply worker. This slot is used to retain > > the dead tuples. The apply worker will advance the slot.xmin after > > confirming > > that all the concurrent transaction on publisher has been applied locally. The replication slot used for this purpose will be a physical one or logical one? And IIUC such a slot doesn't need to retain WAL but if we do that, how do we advance the LSN of the slot? > > 2) the apply worker send a new message to walsender to request the latest > > wal > > flush position(GetFlushRecPtr) on publisher, and save it to > > 'candidate_remote_wal_lsn'. Here we could introduce a new feedback message > > or > > extend the existing keepalive message(e,g extends the requestReply bit in > > keepalive message to add a 'request_wal_position' value) The apply worker sends a keepalive message when it didn't receive anything more than wal_receiver_timeout / 2. So in a very active system, we cannot rely on piggybacking new information to the keepalive messages to get the latest remote flush LSN. > > 3) The apply worker can continue to apply changes. After applying all the > > WALs > > upto 'candidate_remote_wal_lsn', the apply worker can then advance the > > slot.xmin to 'candidate_xmin'. > > > > This approach ensures that dead tuples are not removed until all concurrent > > transactions have been applied. It can be effective for both bidirectional > > and > > non-bidirectional replication cases. > > > > We could introduce a boolean subscription option (retain_dead_tuples) to > > control whether this feature is enabled. Each subscription intending to > > detect > > update-delete conflicts should set retain_dead_tuples to true. > > I'm still studying this idea but let me confirm the following scenario. Suppose both Node-A and Node-B have the same row (1,1) in table t, and XIDs and commit LSNs of T2 and T3 are the following: Node A T2: DELETE FROM t WHERE id = 1 (10:02 AM) XID:100, commit-LSN:1000 Node B T3: UPDATE t SET value = 2 WHERE id 1 (10:01 AM) XID:500, commit-LSN:5000 Further suppose that it's now 10:05 AM, and the latest XID and the latest flush WAL position of Node-A and Node-B are following: Node A current XID: 300 latest flush LSN; 3000 Node B current XID: 700 latest flush LSN: 7000 Both T2 and T3 are NOT sent to Node B and Node A yet, respectively (i.e., the logical replication is delaying for 5 min). Consider the following scenario: 1. The apply worker on Node-A calls GetRunningTransactionData() and gets 301 (set as candidate_xmin). 2. The apply worker on Node-A requests the latest WAL flush position from Node-B, and gets 7000 (set as candidate_remote_wal_lsn). 3. T2 is applied on Node-B, and the latest flush position of Node-B is now 8000. 4. The apply worker on Node-A continues applying changes, and applies the transactions up to remote (commit) LSN 7100. 5. Now that the apply worker on Node-A applied all changes smaller than candidate_remote_wal_lsn (7000), it increases the slot.xmin to 301 (candidate_xmin). 6. On Node-A, vacuum runs and physically removes the tuple that was deleted by T2. Here, on Node-B, there might be a transition between LSN 7100 and 8000 that might require the tuple that is deleted by T2. For example, "UPDATE t SET value = 3 WHERE id = 1" (say T4) is executed on Node-B at LSN 7200, and it's sent to Node-A after step 6. On Node-A, whether we detect "update_deleted" or "update_missing" still depends on when vacuum removes the tuple deleted by T2. If applying T4 raises an "update_missing" (i.e. the changes are applied in the order of T2->T3->(vacuum)->T4), it converts into an insert, resulting in the table having a row with value = 3. If applying T4 raises an "update_deleted" (i.e. the changes are applied in the order of T2->T3->T4->(vacuum)), it's skipped, resulting in the table having no row. On the other hand, in this scenario, Node-B applies changes in the order of T3->T4->T2, and applying T2 raises a "delete_origin_differ", resulting in the table having a row with val=3 (assuming latest_committs_win is the default resolver for this confliction). Please confirm this scenario as I might be missing something. > > As each apply worker needs a separate slot to retain deleted rows, the > requirement for slots will increase. The other possibility is to > maintain one slot by launcher or some other central process that > traverses all subscriptions, remember the ones marked with > retain_dead_rows (let's call this list as retain_sub_list). Then using > running_transactions get the oldest running_xact, and then get the > remote flush location from the other node (publisher node) and store > those as candidate values (candidate_xmin and > candidate_remote_wal_lsn) in slot. We can probably reuse existing > candidate variables of the slot. Next, we can check the remote_flush > locations from all the origins corresponding in retain_sub_list and if > all are ahead of candidate_remote_wal_lsn, we can update the slot's > xmin to candidate_xmin. Does it mean that we use one candiate_remote_wal_lsn in a slot for all subscriptions (in retain_sub_list)? IIUC candiate_remote_wal_lsn is a LSN of one of publishers, so other publishers could have completely different LSNs. How do we compare the candidate_remote_wal_lsn to remote_flush locations from all the origins? Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com