Hi David,

Thanks for the kind words!

>Is there a goal in this CEP to make automated repair work during rolling
upgrades, when multiple versions exist in the cluster?
We debated a lot on this over ASF Slack
(#cassandra-repair-scheduling-cep37). The summary is that, ideally, we want
to have a repair function during the mixed version, but the reality is that
currently, there is no test suite available inside Apache Cassandra to
verify the streaming behavior during the mixed version, so the confidence
is low.
We agreed on the following: 1) Keeping safety in mind, we should by default
disable the repair during mixed version 2) Add a comprehensive test suite
3) Allow repair during mixed version. Currently, we are at #1

>Would automated repair be smart enough to automatically stop, if it sees
incompatible versions?
That's the plan, and we already have PR (CASSANDRA-20048
<https://issues.apache.org/jira/browse/CASSANDRA-20048>) out from Chris
Lohfink. The thing we are debating is whether to stop only during major
version mismatch or also during the minor version, and we are leaning
towards only disabling for the major version mismatch. Regardless, this
should be available soon.
We are also extending this further as per feedback from David Capwell that
we should automatically stop repair if we detect a new DC or keyspace RF is
changed. That will be covered later as part of CASSANDRA-20414
<https://issues.apache.org/jira/browse/CASSANDRA-20414>

>If automated repair must be disabled for the entire cluster, will this be
a single nodetool command, or must automated repair be disabled on each
node individually?
Yes, it is a nodetool command and does not require any restarts! All the
*nodetool* command details are currently covered in the design doc
<https://docs.google.com/document/d/1CJWxjEi-mBABPMZ3VWJ9w5KavWfJETAGxfUpsViPcPo/edit?tab=t.0#heading=h.89fmsespiosd>,
and the same details will also be available in the Cassandra overview.adoc
<https://github.com/apache/cassandra/pull/3598/files?short_path=e901018#diff-e90101885c1188844bb4188d1301277bfdc4a9e1e705c4ab8a6cc5a4b44460c0>
.

>Would it make sense for automated repair to upgrade sstables, if it finds
old formats? (Maybe this could be a feature that could be optionally
enabled?)
My opinion is that it should not be part of the repair. It is best suited
as part of the Cassandra upgrade framework; I guess Paulo M is looking at
it.

>W.R.T. the repair logging tables in the system_distributed keyspace, will
these tables have a configurable TTL, or must they be periodically
truncated to limit their size?
The number of entries will equal the number of Cassandra nodes in a
cluster. There is no TTL because each row represents the repair status of
that particular node. The entries would be automatically added/removed as
nodes are added/removed from the Cassandra cluster.

Jaydeep

On Sat, Mar 8, 2025 at 7:46 AM Dave Herrington <he...@rhinosource.com>
wrote:

> Jaydeep,
>
> Thank you for your excellent efforts on this mission-critical feature.
> The stated goals of CEP-37 are noble and stand to make valuable
> improvements for cluster operations.  I look forward to testing these new
> capabilities.
>
> My apologies up-front if you’ve already answered these questions.  I did
> read the CEP a number of times and the linked JIRAs, but these are my
> questions that I couldn’t answer myself.
>
> I’m interested to understand the goals of CEP-37 W.R.T. to rolling
> upgrades of large clusters, as I am responsible for maintaining the cluster
> operations runbooks for a number of customers.
>
> Operators have to navigate the upgrade gauntlet with automated repairs
> disabled and get all nodes upgraded within gc_grace_seconds and then do a
> full repair, before restarting automated repairs.
>
> I see that CASSANDRA-7530
> https://issues.apache.org/jira/browse/CASSANDRA-7530 is related to this.
>
> Is there a goal in this CEP to make automated repair work during rolling
> upgrades, when multiple versions exist in the cluster?
>
> (I think this would imply that stopping automated repairs would no longer
> be a pre-upgrade step.)
>
> Would automated repair be smart enough to automatically stop, if it sees
> incompatible versions?
>
> Would automated repair continue between nodes with compatible versions, or
> would it stop for the entire cluster?
>
> If automated repair must be disabled for the entire cluster, will this be
> a single nodetool command, or must automated repair be disabled on each
> node individually?
>
> Would it make sense for automated repair to upgrade sstables, if it finds
> old formats? (Maybe this could be a feature that could be optionally
> enabled?)
>
> W.R.T. the repair logging tables in the system_distributed keyspace, will
> these tables have a configurable TTL, or must they be periodically
> truncated to limit their size?
>
> Thanks,
> -Dave
>
> David A. Herrington II
> President and Chief Engineer
> RhinoSource, Inc.
>
> *Data Lake Architecture, Cloud Computing and Advanced Analytics.*
>
> www.rhinosource.com
>
>
> On Fri, Mar 7, 2025 at 11:48 AM Jaydeep Chovatia <
> chovatia.jayd...@gmail.com> wrote:
>
>> Hello Everyone,
>>
>> I wanted to update you on CEP-37
>> <https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-37+Apache+Cassandra+Unified+Repair+Solution>
>>  (Jira:
>> CASSANDRA-19918 <https://issues.apache.org/jira/browse/CASSANDRA-19918>)
>> work.
>> Over the last year, some of us (Andy Tolbert, Chris Lohfink, Francisco
>> Guerrero, and Kristijonas Zalys) have been working closely on making
>> CEP-37 rock solid, with support from Josh McKenzie, Dinesh Joshi, and David
>> Capwell.
>> First and foremost, a huge thank you to everyone, including the
>> broader Apache Cassandra community, for their invaluable contributions in
>> making CEP-37 robust and solid!
>>
>> Here is the current status:
>>
>> *Feature stability*
>>
>>    - *Voted feature:* All the features mentioned in CEP-37 have worked
>>    as expected.
>>    - *Post-voted feature:* A few new minor improvements
>>    
>> <https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=272927365#CEP37ApacheCassandraUnifiedRepairSolution-Post-VoteUpdates>
>>    have been added to post-voting, and they are also working as expected.
>>    - Tested the functionality by multiple people over the period of time.
>>    - Some other facts: it has already been validated at scale
>>    <https://www.youtube.com/watch?v=xFicEj6Nhq8>. Another big Cassandra
>>    use case is in the process of validating/adopting it in their environment.
>>
>> *Source Code*
>>
>>    - It is an opt-in feature; nobody notices anything unless someone
>>    opts in.
>>    - By default, this feature is pretty isolated (in a separate package)
>>    from the source code point of view (94% of the source code lines are in 
>> the
>>    new files)
>>    - A thorough documentation has been added:
>>       - overview.doc
>>       - metrics.doc
>>       - cassandra.yaml doc
>>       - NEWS.txt overview
>>    - Five people (Andy Tolbert, Chris Lohfink, Francisco Guerrero, and
>>    Kristijonas Zalys) have contributed.
>>    - The source code has been reviewed multiple times by the same five
>>    people.
>>
>> *Test Coverage*
>>
>>    - A comprehensive test coverage has been added to cover all aspects.
>>    - The entire test suite has been passing
>>
>>
>> We are in the final review phase and nearly ready to merge. If anyone has
>> any last-minute feedback, this is the final opportunity for review.
>>
>> Thank you!
>> Andy Tolbert, Chris Lohfink, Francisco Guerrero, Kristijonas Zalys, and
>> Jaydeep
>>
>

Reply via email to