[jira] [Updated] (CASSANDRA-20996) Auto-repair scheduler should use LWTs for all auto-repair history operations

Kristijonas Zalys (Jira) Thu, 30 Oct 2025 22:04:07 -0700


     [ 
https://issues.apache.org/jira/browse/CASSANDRA-20996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Kristijonas Zalys updated CASSANDRA-20996:
------------------------------------------
    Description: The a  (was: When a node in the ring goes down, the 
auto-repair scheduler of all other nodes in the cluster start voting for this 
node's auto-repair history to be removed. Once >50% of the cluster vote to 
delete said down node, it's auto-repair history is deleted from the auto-repair 
system tables.

This cleanup process is important to maintain continuous execution of repair 
across the entire cluster. If a node goes down, it will no longer perform 
repair and will not update its repair history in the system tables. However, as 
it is still present in the auto-repair history table, the down node will still 
be considered as a candidate to run repair. As a result, it will occupy space 
in the auto-repair queue and in cases with low auto-repair parallelism may even 
completely block auto-repair within the cluster.

This is exactly what happened on of our small clusters where auto-repair 
parallelism was just one node at a time. A node got replaced but its repair 
history did not get cleaned up which caused the entire auto-repair system to 
grind to a halt.

Upon investigation we found out that the root cause lies in the ordering of 
operations within the auto-repair scheduler:
 # The scheduler will check when was the last time the local node ran repair.
 # If that duration is lower than the repair interval, it will immediately 
short circuit.
 # Otherwise, it will proceed with computing the auto-repair queue and 
determining if it's the local node's turn to run repair.

Importantly, the auto-repair history cleanup happens inside of the auto-repair 
queue algorithm. This means that a given node will clean up orphaned entries in 
auto-repair history only once its repair interval passes. For example: if you 
use auto-repair parallelism of 1 node and a repair interval of 24 hours, the 
orphaned data will not get cleaned up for up to 24 hours and consequently 
auto-repair may get stuck for up to 24 hours as well.

 )

> Auto-repair scheduler should use LWTs for all auto-repair history operations
> ----------------------------------------------------------------------------
>
>                 Key: CASSANDRA-20996
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-20996
>             Project: Apache Cassandra
>          Issue Type: Bug
>          Components: Consistency/Repair
>            Reporter: Kristijonas Zalys
>            Assignee: Kristijonas Zalys
>            Priority: Normal
>
> The a



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (CASSANDRA-20996) Auto-repair scheduler should use LWTs for all auto-repair history operations

Reply via email to