Kristijonas Zalys created CASSANDRA-20996:
---------------------------------------------

             Summary: Auto-repair scheduler should use LWTs for all auto-repair 
history operations
                 Key: CASSANDRA-20996
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-20996
             Project: Apache Cassandra
          Issue Type: Bug
          Components: Consistency/Repair
            Reporter: Kristijonas Zalys
            Assignee: Kristijonas Zalys


When a node in the ring goes down, the auto-repair scheduler of all other nodes 
in the cluster start voting for this node's auto-repair history to be removed. 
Once >50% of the cluster vote to delete said down node, it's auto-repair 
history is deleted from the auto-repair system tables.

This cleanup process is important to maintain continuous execution of repair 
across the entire cluster. If a node goes down, it will no longer perform 
repair and will not update its repair history in the system tables. However, as 
it is still present in the auto-repair history table, the down node will still 
be considered as a candidate to run repair. As a result, it will occupy space 
in the auto-repair queue and in cases with low auto-repair parallelism may even 
completely block auto-repair within the cluster.

This is exactly what happened on of our small clusters where auto-repair 
parallelism was just one node at a time. A node got replaced but its repair 
history did not get cleaned up which caused the entire auto-repair system to 
grind to a halt.

Upon investigation we found out that the root cause lies in the ordering of 
operations within the auto-repair scheduler:
 # The scheduler will check when was the last time the local node ran repair.
 # If that duration is lower than the repair interval, it will immediately 
short circuit.
 # Otherwise, it will proceed with computing the auto-repair queue and 
determining if it's the local node's turn to run repair.

Importantly, the auto-repair history cleanup happens inside of the auto-repair 
queue algorithm. This means that a given node will clean up orphaned entries in 
auto-repair history only once its repair interval passes. For example: if you 
use auto-repair parallelism of 1 node and a repair interval of 24 hours, the 
orphaned data will not get cleaned up for up to 24 hours and consequently 
auto-repair may get stuck for up to 24 hours as well.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to