[ https://issues.apache.org/jira/browse/IGNITE-20995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17796781#comment-17796781 ]
Denis Chudov edited comment on IGNITE-20995 at 12/15/23 9:33 AM: ----------------------------------------------------------------- >From my point of view, we should have following scenarios: # Lock conflict between two RW transactions, coordinator is lost for lock holder, recovery starts, lock holder is aborted, lock waiter is not affected. # Lock conflict between two RW transactions, coordinator is alive, recovery is not started, lock waiter is not affected. # Resolution of write intent belonging to tx without coordinator, recovery happens successfully, write intent is switched. # Resolution of write intent belonging to abandoned tx, recovery happens successfully, write intent is switched. # Resolution of write intent belonging to abandoned tx, commit partition has restarted and lost its local volatile tx state map, recovery happens successfully, write intent is switched. # Resolution of write intent belonging to pending transaction, coordinator is alive, recovery is not started. # RO transaction tx0 resolves write intent belonging to the transaction tx1 and marks it as abandoned and starts the recovery; after that RW transaction tx2 meets the lock belonging to tx1, sees that it's abandoned recently and doesn't start the recovery. Recovery is triggered just once. # Coordinator is lost, but it has sent the commit message to a commit partition, in the same time the recovery initiating request is received from some data node. Commit is successful, tx recovery was not able to change transaction state, there are no assertions or other errors, write intents on data node are switched. # Coordinator is lost, but it has sent the commit message to a commit partition, in the same time the recovery initiating request is received from some data node. Recovery successfully aborts the transaction, the is correct exception on the coordinator, it was not able to change transaction state to commit, there are no assertions or other errors, write intents on data node are switched. # Parallel tx recoveries happen on two replicas of commit partition, both processes were started at a moment when the corresponding replica was the primary one. This also shouldnt break anything. # There are two parallel recoveries on commit partition and a commit process initiated by coordinator that is already dead. Both recovery processes get correct commit timestamp and resolve write intent correctly. was (Author: denis chudov): >From my point of view, we should have following scenarios: # Lock conflict between two RW transactions, coordinator is lost for lock holder, recovery starts, lock holder is aborted, lock waiter is not affected. # Lock conflict between two RW transactions, coordinator is alive, recovery is not started, lock waiter is not affected. # Resolution of write intent belonging to tx without coordinator, recovery happens successfully, write intent is switched. # Resolution of write intent belonging to abandoned tx, recovery happens successfully, write intent is switched. # Resolution of write intent belonging to abandoned tx, commit partition has restarted and lost its local volatile tx state map, recovery happens successfully, write intent is switched. # Resolution of write intent belonging to pending transaction, coordinator is alive, recovery is not started. # RO transaction tx0 resolves write intent belonging to the transaction tx1 and marks it as abandoned and starts the recovery; after that RW transaction tx2 meets the lock belonging to tx1, sees that it's abandoned recently and doesn't start the recovery. Recovery is triggered just once. # Coordinator is lost, but it has sent the commit message to a commit partition, in the same time the recovery initiating request is received from some data node. Commit is successful, tx recovery was not able to change transaction state, there are no assertions or other errors, write intents on data node are switched. # Coordinator is lost, but it has sent the commit message to a commit partition, in the same time the recovery initiating request is received from some data node. Recovery successfully aborts the transaction, the is correct exception on the coordinator, it was not able to change transaction state to commit, there are no assertions or other errors, write intents on data node are switched. # Parallel tx recoveries happen on two replicas of commit partition, both processes were started at a moment when the corresponding replica was the primary one. This also shouldnt break anything. > Add more integration tests for tx recovery on unstable topology > --------------------------------------------------------------- > > Key: IGNITE-20995 > URL: https://issues.apache.org/jira/browse/IGNITE-20995 > Project: Ignite > Issue Type: Improvement > Reporter: Alexander Lapin > Assignee: Kirill Sizov > Priority: Major > Labels: ignite-3 > > h3. Motivation > Surprisingly it might be useful to check tx recovery implementation with some > tests. > h3. Defintion of Done > <Specific scenarios will be added a bit later> -- This message was sent by Atlassian Jira (v8.20.10#820010)