On Thu, Sep 5, 2024 at 5:07 PM Zhijie Hou (Fujitsu) <houzj.f...@fujitsu.com> wrote: > > > Hi hackers, > > I am starting a new thread to discuss and propose the conflict detection for > update_deleted scenarios during logical replication. This conflict occurs when > the apply worker cannot find the target tuple to be updated, as the tuple > might > have been removed by another origin. > > --- > BACKGROUND > --- > > Currently, when the apply worker cannot find the target tuple during an > update, > an update_missing conflict is logged. However, to facilitate future automatic > conflict resolution, it has been agreed[1][2] that we need to detect both > update_missing and update_deleted conflicts. Specifically, we will detect an > update_deleted conflict if any dead tuple matching the old key value of the > update operation is found; otherwise, it will be classified as update_missing. > > Detecting both update_deleted and update_missing conflicts is important for > achieving eventual consistency in a bidirectional cluster, because the > resolution for each conflict type can differs. For example, for an > update_missing conflict, a feasible solution might be converting the update to > an insert and applying it. While for an update_deleted conflict, the preferred > approach could be to skip the update or compare the timestamps of the delete > transactions with the remote update transaction's and choose the most recent > one. For additional context, please refer to [3], which gives examples about > how these differences could lead to data divergence. > > --- > ISSUES and SOLUTION > --- > > To detect update_deleted conflicts, we need to search for dead tuples in the > table. However, dead tuples can be removed by VACUUM at any time. Therefore, > to > ensure consistent and accurate conflict detection, tuples deleted by other > origins must not be removed by VACUUM before the conflict detection process. > If > the tuples are removed prematurely, it might lead to incorrect conflict > identification and resolution, causing data divergence between nodes. > > Here is an example of how VACUUM could affect conflict detection and how to > prevent this issue. Assume we have a bidirectional cluster with two nodes, A > and B. > > Node A: > T1: INSERT INTO t (id, value) VALUES (1,1); > T2: DELETE FROM t WHERE id = 1; > > Node B: > T3: UPDATE t SET value = 2 WHERE id = 1; > > To retain the deleted tuples, the initial idea was that once transaction T2 > had > been applied to both nodes, there was no longer a need to preserve the dead > tuple on Node A. However, a scenario arises where transactions T3 and T2 occur > concurrently, with T3 committing slightly earlier than T2. In this case, if > Node B applies T2 and Node A removes the dead tuple (1,1) via VACUUM, and then > Node A applies T3 after the VACUUM operation, it can only result in an > update_missing conflict. Given that the default resolution for update_missing > conflicts is apply_or_skip (e.g. convert update to insert if possible and > apply > the insert), Node A will eventually hold a row (1,2) while Node B becomes > empty, causing data inconsistency. > > Therefore, the strategy needs to be expanded as follows: Node A cannot remove > the dead tuple until: > (a) The DELETE operation is replayed on all remote nodes, *AND* > (b) The transactions on logical standbys occurring before the replay of Node > A's DELETE are replayed on Node A as well. > > --- > THE DESIGN > --- > > To achieve the above, we plan to allow the logical walsender to maintain and > advance the slot.xmin to protect the data in the user table and introduce a > new > logical standby feedback message. This message reports the WAL position that > has been replayed on the logical standby *AND* the changes occurring on the > logical standby before the WAL position are also replayed to the walsender's > node (where the walsender is running). After receiving the new feedback > message, the walsender will advance the slot.xmin based on the flush info, > similar to the advancement of catalog_xmin. Currently, the effective_xmin/xmin > of logical slot are unused during logical replication, so I think it's safe > and > won't cause side-effect to reuse the xmin for this feature. > > We have introduced a new subscription option (feedback_slots='slot1,...'), > where these slots will be used to check condition (b): the transactions on > logical standbys occurring before the replay of Node A's DELETE are replayed > on > Node A as well. Therefore, on Node B, users should specify the slots > corresponding to Node A in this option. The apply worker will get the oldest > confirmed flush LSN among the specified slots and send the LSN as a feedback > message to the walsender. -- I also thought of making it an automaic way, e.g. > let apply worker select the slots that acquired by the walsenders which > connect > to the same remote server(e.g. if apply worker's connection info or some other > flags is same as the walsender's connection info). But it seems tricky because > if some slots are inactive which means the walsenders are not there, the apply > worker could not find the correct slots to check unless we save the host along > with the slot's persistence data. > > The new feedback message is sent only if feedback_slots is not NULL. If the > slots in feedback_slots are removed, a final message containing > InvalidXLogRecPtr will be sent to inform the walsender to forget about the > slot.xmin. > > To detect update_deleted conflicts during update operations, if the target row > cannot be found, we perform an additional scan of the table using snapshotAny. > This scan aims to locate the most recently deleted row that matches the old > column values from the remote update operation and has not yet been removed by > VACUUM. If any such tuples are found, we report the update_deleted conflict > along with the origin and transaction information that deleted the tuple. > > Please refer to the attached POC patch set which implements above design. The > patch set is split into some parts to make it easier for the initial review. > Please note that each patch is interdependent and cannot work independently. > > Thanks a lot to Kuroda-San and Amit for the off-list discussion. > > Suggestions and comments are highly appreciated ! >
Thank You Hou-San for explaining the design. But to make it easier to understand, would you be able to explain the sequence/timeline of the *new* actions performed by the walsender and the apply processes for the given example along with new feedback_slot config needed Node A: (Procs: walsenderA, applyA) T1: INSERT INTO t (id, value) VALUES (1,1); ts=10.00 AM T2: DELETE FROM t WHERE id = 1; ts=10.02 AM Node B: (Procs: walsenderB, applyB) T3: UPDATE t SET value = 2 WHERE id = 1; ts=10.01 AM thanks Shveta