On Mon, Sep 23, 2024 at 8:32 PM Zhijie Hou (Fujitsu) <houzj.f...@fujitsu.com> wrote: > > On Tuesday, September 24, 2024 5:05 AM Masahiko Sawada > <sawada.m...@gmail.com> wrote: > > > > Thank you for considering another idea. > > Thanks for reviewing the idea! > > > > > On Fri, Sep 20, 2024 at 2:46 AM Amit Kapila <amit.kapil...@gmail.com> wrote: > > > > > > On Fri, Sep 20, 2024 at 8:25 AM Zhijie Hou (Fujitsu) > > > <houzj.f...@fujitsu.com> wrote: > > > > > > > > Apart from the vacuum_defer_cleanup_age idea. > > > > > > > > > > I think you meant to say vacuum_committs_age idea. > > > > > > > we’ve given more thought to our > > > > approach for retaining dead tuples and have come up with another idea > > that can > > > > reliably detect conflicts without requiring users to choose a wise > > > > value for > > > > the vacuum_committs_age. This new idea could also reduce the > > performance > > > > impact. Thanks a lot to Amit for off-list discussion. > > > > > > > > The concept of the new idea is that, the dead tuples are only useful to > > detect > > > > conflicts when applying *concurrent* transactions from remotes. Any > > subsequent > > > > UPDATE from a remote node after removing the dead tuples should have a > > later > > > > timestamp, meaning it's reasonable to detect an update_missing scenario > > and > > > > convert the UPDATE to an INSERT when applying it. > > > > > > > > To achieve above, we can create an additional replication slot on the > > > > subscriber side, maintained by the apply worker. This slot is used to > > > > retain > > > > the dead tuples. The apply worker will advance the slot.xmin after > > confirming > > > > that all the concurrent transaction on publisher has been applied > > > > locally. > > > > The replication slot used for this purpose will be a physical one or > > logical one? And IIUC such a slot doesn't need to retain WAL but if we > > do that, how do we advance the LSN of the slot? > > I think it would be a logical slot. We can keep the > restart_lsn/confirmed_flush_lsn as invalid because we don't need to retain the > WALs for decoding purpose. > > > > > > > 2) the apply worker send a new message to walsender to request the > > > > latest > > wal > > > > flush position(GetFlushRecPtr) on publisher, and save it to > > > > 'candidate_remote_wal_lsn'. Here we could introduce a new feedback > > message or > > > > extend the existing keepalive message(e,g extends the requestReply bit > > > > in > > > > keepalive message to add a 'request_wal_position' value) > > > > The apply worker sends a keepalive message when it didn't receive > > anything more than wal_receiver_timeout / 2. So in a very active > > system, we cannot rely on piggybacking new information to the > > keepalive messages to get the latest remote flush LSN. > > Right. I think we need to send this new message at some interval independent > of > wal_receiver_timeout. > > > > > > > 3) The apply worker can continue to apply changes. After applying all > > > > the > > WALs > > > > upto 'candidate_remote_wal_lsn', the apply worker can then advance the > > > > slot.xmin to 'candidate_xmin'. > > > > > > > > This approach ensures that dead tuples are not removed until all > > concurrent > > > > transactions have been applied. It can be effective for both > > > > bidirectional > > and > > > > non-bidirectional replication cases. > > > > > > > > We could introduce a boolean subscription option (retain_dead_tuples) to > > > > control whether this feature is enabled. Each subscription intending to > > detect > > > > update-delete conflicts should set retain_dead_tuples to true. > > > > > > > > I'm still studying this idea but let me confirm the following scenario. > > > > Suppose both Node-A and Node-B have the same row (1,1) in table t, and > > XIDs and commit LSNs of T2 and T3 are the following: > > > > Node A > > T2: DELETE FROM t WHERE id = 1 (10:02 AM) XID:100, commit-LSN:1000 > > > > Node B > > T3: UPDATE t SET value = 2 WHERE id 1 (10:01 AM) XID:500, > > commit-LSN:5000 > > > > Further suppose that it's now 10:05 AM, and the latest XID and the > > latest flush WAL position of Node-A and Node-B are following: > > > > Node A > > current XID: 300 > > latest flush LSN; 3000 > > > > Node B > > current XID: 700 > > latest flush LSN: 7000 > > > > Both T2 and T3 are NOT sent to Node B and Node A yet, respectively > > (i.e., the logical replication is delaying for 5 min). > > > > Consider the following scenario: > > > > 1. The apply worker on Node-A calls GetRunningTransactionData() and > > gets 301 (set as candidate_xmin). > > 2. The apply worker on Node-A requests the latest WAL flush position > > from Node-B, and gets 7000 (set as candidate_remote_wal_lsn). > > 3. T2 is applied on Node-B, and the latest flush position of Node-B is now > > 8000. > > 4. The apply worker on Node-A continues applying changes, and applies > > the transactions up to remote (commit) LSN 7100. > > 5. Now that the apply worker on Node-A applied all changes smaller > > than candidate_remote_wal_lsn (7000), it increases the slot.xmin to > > 301 (candidate_xmin). > > 6. On Node-A, vacuum runs and physically removes the tuple that was > > deleted by T2. > > > > Here, on Node-B, there might be a transition between LSN 7100 and 8000 > > that might require the tuple that is deleted by T2. > > > > For example, "UPDATE t SET value = 3 WHERE id = 1" (say T4) is > > executed on Node-B at LSN 7200, and it's sent to Node-A after step 6. > > On Node-A, whether we detect "update_deleted" or "update_missing" > > still depends on when vacuum removes the tuple deleted by T2. > > I think in this case, no matter we detect "update_delete" or "update_missing", > the final data is the same. Because T4's commit timestamp should be later than > T2 on node A, so in the case of "update_deleted", it will compare the commit > timestamp of the deleted tuple's xmax with T4's timestamp, and T4 should win, > which means we will convert the update into insert and apply. Even if the > deleted tuple is deleted and "update_missing" is detected, the update will > still be converted into insert and applied. So, the result is the same.
The "latest_timestamp_wins" is the default resolution method for "update_deleted"? When I checked the wiki page[1], the "skip" was the default solution method for that. Regards, [1] https://wiki.postgresql.org/wiki/Conflict_Detection_and_Resolution#Defaults -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com