[jira] [Comment Edited] (IGNITE-20995) Add more integration tests for tx recovery on unstable topology

Denis Chudov (Jira) Fri, 15 Dec 2023 01:36:20 -0800


    [ 
https://issues.apache.org/jira/browse/IGNITE-20995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17796781#comment-17796781
 ]


Denis Chudov edited comment on IGNITE-20995 at 12/15/23 9:33 AM:
-----------------------------------------------------------------

>From my point of view, we should have following scenarios:
 # Lock conflict between two RW transactions, coordinator is lost for lock 
holder, recovery starts, lock holder is aborted, lock waiter is not affected.
 # Lock conflict between two RW transactions, coordinator is alive, recovery is 
not started, lock waiter is not affected.
 # Resolution of write intent belonging to tx without coordinator, recovery 
happens successfully, write intent is switched.
 # Resolution of write intent belonging to abandoned tx, recovery happens 
successfully, write intent is switched.
 # Resolution of write intent belonging to abandoned tx, commit partition has 
restarted and lost its local volatile tx state map, recovery happens 
successfully, write intent is switched.
 # Resolution of write intent belonging to pending transaction, coordinator is 
alive, recovery is not started.
 # RO transaction tx0 resolves write intent belonging to the transaction tx1 
and marks it as abandoned and starts the recovery; after that RW transaction 
tx2 meets the lock belonging to tx1, sees that it's abandoned recently and 
doesn't start the recovery. Recovery is triggered just once.
 # Coordinator is lost, but it has sent the commit message to a commit 
partition, in the same time the recovery initiating request is received from 
some data node. Commit is successful, tx recovery was not able to change 
transaction state, there are no assertions or other errors, write intents on 
data node are switched.
 # Coordinator is lost, but it has sent the commit message to a commit 
partition, in the same time the recovery initiating request is received from 
some data node. Recovery successfully aborts the transaction, the is correct 
exception on the coordinator, it was not able to change transaction state to 
commit, there are no assertions or other errors, write intents on data node are 
switched.
 # Parallel tx recoveries happen on two replicas of commit partition, both 
processes were started at a moment when the corresponding replica was the 
primary one. This also shouldnt break anything.
 # There are two parallel recoveries on commit partition and a commit process 
initiated by coordinator that is already dead. Both recovery processes get 
correct commit timestamp and resolve write intent correctly.


was (Author: denis chudov):
>From my point of view, we should have following scenarios:
 # Lock conflict between two RW transactions, coordinator is lost for lock 
holder, recovery starts, lock holder is aborted, lock waiter is not affected.
 # Lock conflict between two RW transactions, coordinator is alive, recovery is 
not started, lock waiter is not affected.
 # Resolution of write intent belonging to tx without coordinator, recovery 
happens successfully, write intent is switched.
 # Resolution of write intent belonging to abandoned tx, recovery happens 
successfully, write intent is switched.
 # Resolution of write intent belonging to abandoned tx, commit partition has 
restarted and lost its local volatile tx state map, recovery happens 
successfully, write intent is switched.
 # Resolution of write intent belonging to pending transaction, coordinator is 
alive, recovery is not started.
 # RO transaction tx0 resolves write intent belonging to the transaction tx1 
and marks it as abandoned and starts the recovery; after that RW transaction 
tx2 meets the lock belonging to tx1, sees that it's abandoned recently and 
doesn't start the recovery. Recovery is triggered just once.
 # Coordinator is lost, but it has sent the commit message to a commit 
partition, in the same time the recovery initiating request is received from 
some data node. Commit is successful, tx recovery was not able to change 
transaction state, there are no assertions or other errors, write intents on 
data node are switched.
 # Coordinator is lost, but it has sent the commit message to a commit 
partition, in the same time the recovery initiating request is received from 
some data node. Recovery successfully aborts the transaction, the is correct 
exception on the coordinator, it was not able to change transaction state to 
commit, there are no assertions or other errors, write intents on data node are 
switched.
 # Parallel tx recoveries happen on two replicas of commit partition, both 
processes were started at a moment when the corresponding replica was the 
primary one. This also shouldnt break anything.

> Add more integration tests for tx recovery on unstable topology
> ---------------------------------------------------------------
>
>                 Key: IGNITE-20995
>                 URL: https://issues.apache.org/jira/browse/IGNITE-20995
>             Project: Ignite
>          Issue Type: Improvement
>            Reporter: Alexander Lapin
>            Assignee:  Kirill Sizov
>            Priority: Major
>              Labels: ignite-3
>
> h3. Motivation
> Surprisingly it might be useful to check tx recovery implementation with some 
> tests.
> h3. Defintion of Done
> <Specific scenarios will be added a bit later>



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (IGNITE-20995) Add more integration tests for tx recovery on unstable topology

Reply via email to