On Tue, Sep 24, 2024 at 12:14 AM Zhijie Hou (Fujitsu) <houzj.f...@fujitsu.com> wrote: > > On Tuesday, September 24, 2024 2:42 PM Masahiko Sawada > <sawada.m...@gmail.com> wrote: > > > > On Mon, Sep 23, 2024 at 8:32 PM Zhijie Hou (Fujitsu) > > <houzj.f...@fujitsu.com> wrote: > > > > > > On Tuesday, September 24, 2024 5:05 AM Masahiko Sawada > > <sawada.m...@gmail.com> wrote: > > > > I'm still studying this idea but let me confirm the following scenario. > > > > > > > > Suppose both Node-A and Node-B have the same row (1,1) in table t, > > > > and XIDs and commit LSNs of T2 and T3 are the following: > > > > > > > > Node A > > > > T2: DELETE FROM t WHERE id = 1 (10:02 AM) XID:100, > > commit-LSN:1000 > > > > > > > > Node B > > > > T3: UPDATE t SET value = 2 WHERE id 1 (10:01 AM) XID:500, > > > > commit-LSN:5000 > > > > > > > > Further suppose that it's now 10:05 AM, and the latest XID and the > > > > latest flush WAL position of Node-A and Node-B are following: > > > > > > > > Node A > > > > current XID: 300 > > > > latest flush LSN; 3000 > > > > > > > > Node B > > > > current XID: 700 > > > > latest flush LSN: 7000 > > > > > > > > Both T2 and T3 are NOT sent to Node B and Node A yet, respectively > > > > (i.e., the logical replication is delaying for 5 min). > > > > > > > > Consider the following scenario: > > > > > > > > 1. The apply worker on Node-A calls GetRunningTransactionData() and > > > > gets 301 (set as candidate_xmin). > > > > 2. The apply worker on Node-A requests the latest WAL flush position > > > > from Node-B, and gets 7000 (set as candidate_remote_wal_lsn). > > > > 3. T2 is applied on Node-B, and the latest flush position of Node-B is > > > > now > > 8000. > > > > 4. The apply worker on Node-A continues applying changes, and > > > > applies the transactions up to remote (commit) LSN 7100. > > > > 5. Now that the apply worker on Node-A applied all changes smaller > > > > than candidate_remote_wal_lsn (7000), it increases the slot.xmin to > > > > 301 (candidate_xmin). > > > > 6. On Node-A, vacuum runs and physically removes the tuple that was > > > > deleted by T2. > > > > > > > > Here, on Node-B, there might be a transition between LSN 7100 and > > > > 8000 that might require the tuple that is deleted by T2. > > > > > > > > For example, "UPDATE t SET value = 3 WHERE id = 1" (say T4) is > > > > executed on Node-B at LSN 7200, and it's sent to Node-A after step 6. > > > > On Node-A, whether we detect "update_deleted" or "update_missing" > > > > still depends on when vacuum removes the tuple deleted by T2. > > > > > > I think in this case, no matter we detect "update_delete" or > > > "update_missing", the final data is the same. Because T4's commit > > > timestamp should be later than > > > T2 on node A, so in the case of "update_deleted", it will compare the > > > commit timestamp of the deleted tuple's xmax with T4's timestamp, and > > > T4 should win, which means we will convert the update into insert and > > > apply. Even if the deleted tuple is deleted and "update_missing" is > > > detected, the update will still be converted into insert and applied. So, > > > the > > result is the same. > > > > The "latest_timestamp_wins" is the default resolution method for > > "update_deleted"? When I checked the wiki page[1], the "skip" was the > > default > > solution method for that. > > Right, I think the wiki needs some update. > > I think using 'skip' as default for update_delete could easily cause data > divergence when the dead tuple is deleted by an old transaction while the > UPDATE has a newer timestamp like the case you mentioned. It's necessary to > follow the last update win strategy when the incoming update has later > timestamp, which is to convert update to insert.
Right. If "latest_timestamp_wins" is the default resolution for "update_deleted", I think your idea works fine unless I'm missing corner cases. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com