Jaydeep, Thank you for your excellent efforts on this mission-critical feature. The stated goals of CEP-37 are noble and stand to make valuable improvements for cluster operations. I look forward to testing these new capabilities.
My apologies up-front if you’ve already answered these questions. I did read the CEP a number of times and the linked JIRAs, but these are my questions that I couldn’t answer myself. I’m interested to understand the goals of CEP-37 W.R.T. to rolling upgrades of large clusters, as I am responsible for maintaining the cluster operations runbooks for a number of customers. Operators have to navigate the upgrade gauntlet with automated repairs disabled and get all nodes upgraded within gc_grace_seconds and then do a full repair, before restarting automated repairs. I see that CASSANDRA-7530 https://issues.apache.org/jira/browse/CASSANDRA-7530 is related to this. Is there a goal in this CEP to make automated repair work during rolling upgrades, when multiple versions exist in the cluster? (I think this would imply that stopping automated repairs would no longer be a pre-upgrade step.) Would automated repair be smart enough to automatically stop, if it sees incompatible versions? Would automated repair continue between nodes with compatible versions, or would it stop for the entire cluster? If automated repair must be disabled for the entire cluster, will this be a single nodetool command, or must automated repair be disabled on each node individually? Would it make sense for automated repair to upgrade sstables, if it finds old formats? (Maybe this could be a feature that could be optionally enabled?) W.R.T. the repair logging tables in the system_distributed keyspace, will these tables have a configurable TTL, or must they be periodically truncated to limit their size? Thanks, -Dave David A. Herrington II President and Chief Engineer RhinoSource, Inc. *Data Lake Architecture, Cloud Computing and Advanced Analytics.* www.rhinosource.com On Fri, Mar 7, 2025 at 11:48 AM Jaydeep Chovatia <chovatia.jayd...@gmail.com> wrote: > Hello Everyone, > > I wanted to update you on CEP-37 > <https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-37+Apache+Cassandra+Unified+Repair+Solution> > (Jira: > CASSANDRA-19918 <https://issues.apache.org/jira/browse/CASSANDRA-19918>) > work. > Over the last year, some of us (Andy Tolbert, Chris Lohfink, Francisco > Guerrero, and Kristijonas Zalys) have been working closely on making > CEP-37 rock solid, with support from Josh McKenzie, Dinesh Joshi, and David > Capwell. > First and foremost, a huge thank you to everyone, including the > broader Apache Cassandra community, for their invaluable contributions in > making CEP-37 robust and solid! > > Here is the current status: > > *Feature stability* > > - *Voted feature:* All the features mentioned in CEP-37 have worked as > expected. > - *Post-voted feature:* A few new minor improvements > > <https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=272927365#CEP37ApacheCassandraUnifiedRepairSolution-Post-VoteUpdates> > have been added to post-voting, and they are also working as expected. > - Tested the functionality by multiple people over the period of time. > - Some other facts: it has already been validated at scale > <https://www.youtube.com/watch?v=xFicEj6Nhq8>. Another big Cassandra > use case is in the process of validating/adopting it in their environment. > > *Source Code* > > - It is an opt-in feature; nobody notices anything unless someone opts > in. > - By default, this feature is pretty isolated (in a separate package) > from the source code point of view (94% of the source code lines are in the > new files) > - A thorough documentation has been added: > - overview.doc > - metrics.doc > - cassandra.yaml doc > - NEWS.txt overview > - Five people (Andy Tolbert, Chris Lohfink, Francisco Guerrero, and > Kristijonas Zalys) have contributed. > - The source code has been reviewed multiple times by the same five > people. > > *Test Coverage* > > - A comprehensive test coverage has been added to cover all aspects. > - The entire test suite has been passing > > > We are in the final review phase and nearly ready to merge. If anyone has > any last-minute feedback, this is the final opportunity for review. > > Thank you! > Andy Tolbert, Chris Lohfink, Francisco Guerrero, Kristijonas Zalys, and > Jaydeep >