[jira] [Commented] (CASSANDRA-19918) Automated Repair Inside Cassandra

Jaydeepkumar Chovatia (Jira) Sun, 20 Apr 2025 21:34:05 -0700


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-19918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17946034#comment-17946034
 ]


Jaydeepkumar Chovatia commented on CASSANDRA-19918:
---------------------------------------------------

Rebase on top of the latest trunk has been completed; all the unit tests are 
passing on CircleCI
- 
[java17_pre-commit_tests|https://app.circleci.com/pipelines/github/jaydeepkumar1984/cassandra/531/workflows/432995d7-efe5-4f8d-9f7e-1eef1873e612]
- 
[java17_separate_tests|https://app.circleci.com/pipelines/github/jaydeepkumar1984/cassandra/531/workflows/73a7a1f5-7030-4ed8-9768-48adca81f5f1]
- 
[java11_pre-commit_tests|https://app.circleci.com/pipelines/github/jaydeepkumar1984/cassandra/531/workflows/98cdee5e-fb46-48a6-b27a-10a36e616f78]
- 
[java11_separate_tests|https://app.circleci.com/pipelines/github/jaydeepkumar1984/cassandra/531/workflows/d69f7f72-2e38-4b14-b269-361e56a4ba8e]

 

Attached is the complete report with dtest, and that looks clear! 

[^trunk_cep_37_post_accord.html]

 

Thanks [~tolbertam] for running the complete test.

> Automated Repair Inside Cassandra
> ---------------------------------
>
>                 Key: CASSANDRA-19918
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-19918
>             Project: Apache Cassandra
>          Issue Type: Epic
>          Components: Consistency/Repair
>            Reporter: Jaydeepkumar Chovatia
>            Assignee: Jaydeepkumar Chovatia
>            Priority: Normal
>         Attachments: trunk_cep_37_2025_04_06-1_ci_summary.html, 
> trunk_cep_37_post_accord.html
>
>
> h1. Motivation
> Anti-entropy (Apache Cassandra repairs) is essential for every Apache 
> Cassandra cluster to fix data inconsistencies. Frequent data deletions and 
> downed nodes are common causes of data inconsistency. A few open-source 
> orchestration solutions that trigger repair externally are available, as many 
> large users have needed to figure out a scalable repair solution. However, 
> multiple custom solutions have led to a lot of confusion in the community. 
> Therefore, the repair activity, like Compaction, should be an integral part 
> of Cassandra to call it a complete solution.
>  
> The proposal is to align one solution among the existing solutions and make 
> it part of the core Cassandra. Here is the design for one of the solutions:
>  
> Inside Cassandra, there are multiple repairs we would have to schedule:
> 1) Full repair
> 2) Incremental Repair 
> 3) Paxos repair
>  
> The design of the scheduler should be capable of extending multiple repair 
> categories with a minimal code change, and all repair types should progress 
> automatically with minimal manual intervention. 
> Migrating[[1|https://stackoverflow.com/questions/42182984/how-do-i-enable-incremental-repair-on-cassandra-2-1-13]]
>  (and rollback) to/from incremental repair has been extremely challenging, 
> especially in a large fleet. One of the design principles is to make it 
> almost touchless from the operator’s point of view.
> h1. The Scheduler
> Keeping the above motivation in mind, this design embarks on our journey to 
> have the repair orchestration inside Cassandra itself, which will repair the 
> entire ring. 
> A dedicated thread pool is assigned to the repair scheduler at a higher 
> level. The repair scheduler inside Cassandra maintains a new replicated table 
> under a distributed _system_distributed_ keyspace. This table maintains the 
> repair history for all the nodes, such as when it was repaired the last time, 
> etc. The scheduler will pick the node(s) that run the repair first and 
> continue orchestration to ensure Every table and all of their token ranges 
> are repaired. The algorithm can also run repairs simultaneously on multiple 
> nodes and splits the token range into subranges with the necessary retry to 
> handle transient failures. Over the period, the automatic repair has become 
> so reliable that it runs as soon as we start a Cassandra cluster, like 
> Compaction, and does not require manual intervention. 
> Due to this fully automated repair scheduler inside Cassandra, there is no 
> dependency on the control plane, significantly reducing our operational 
> overhead.
> *CEP:* 
> [https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-37+Apache+Cassandra+Unified+Repair+Solution]
> h2. Detailed Design Doc 
> [Automated Repair in 
> Cassandra|https://docs.google.com/document/d/1CJWxjEi-mBABPMZ3VWJ9w5KavWfJETAGxfUpsViPcPo/edit#heading=h.r112r46toau0]
> h2. PR (on 4.1.6) (Last active: Sep 2024)
> Many folks currently are using 4.1.6 in production. Hence, the following PR 
> on 4.1.6 will make it easier for everybody to review the code, test, etc.  If 
> the community decides to merge this CEP, then it will land on the _trunk_ as 
> opposed to {_}4.1{_}.
> [https://github.com/apache/cassandra/pull/3367/]
> h2. PR (on {_}trunk{_}) (Last active: Sep 2024)
> [https://github.com/apache/cassandra/pull/3598]
>  
> h2. Discussion over Slack
> [[1]|https://the-asf.slack.com/archives/CK23JSY2K/p1690225062383619] 
> [[2]|http://cassandra-repair-scheduling-cep37/]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-19918) Automated Repair Inside Cassandra

Reply via email to