Sorry, there is a typo in the CEP-37 link; here is the correct link <https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-37+Apache+Cassandra+Unified+Repair+Solution>
On Thu, Oct 17, 2024 at 4:36 PM Jaydeep Chovatia <chovatia.jayd...@gmail.com> wrote: > First, thank you for your patience while we strengthened the CEP-37. > > > Over the last eight months, Chris Lohfink, Andy Tolbert, Josh McKenzie, > Dinesh Joshi, Kristijonas Zalys, and I have done tons of work (online > discussions/a dedicated Slack channel #cassandra-repair-scheduling-cep37) > to come up with the best possible design that not only significantly > simplifies repair operations but also includes the most common features > that everyone will benefit from running at Scale. > > For example, > > - > > Apache Cassandra must be capable of running multiple repair types, > such as Full, Incremental, Paxos, and Preview - so the framework should be > easily extendable with no additional overhead from the operator’s point of > view. > - > > An easy way to extend the token-split calculation algorithm with a > default implementation should exist. > - > > Running incremental repair reliably at Scale is pretty challenging, so > we need to place safeguards, such as migration/rollback w/o restart and > stopping incremental repair automatically if the disk is about to get full. > > We are glad to inform you that CEP-37 (i.e., Repair inside Cassandra) is > now officially ready for review after multiple rounds of design, testing, > code reviews, documentation reviews, and, more importantly, validation that > it runs at Scale! > > > Some facts about CEP-37. > > - > > Multiple members have verified all aspects of CEP-37 numerous times. > - > > The design proposed in CEP-37 has been thoroughly tried and tested on > an immense scale (hundreds of unique Cassandra clusters, tens of thousands > of Cassandra nodes, with tens of millions of QPS) on top of 4.1 open-source > for more than five years; please see more details here > > <https://www.uber.com/en-US/blog/how-uber-optimized-cassandra-operations-at-scale/> > . > - > > The following presentation > > <https://docs.google.com/presentation/d/1Zilww9c7LihHULk_ckErI2s4XbObxjWknKqRtbvHyZc/edit#slide=id.g30a4fd4fcf7_0_13> > highlights the rigorous applied to CEP-37, which was given during last > week’s Apache Cassandra Bay Area Meetup > <https://www.meetup.com/apache-cassandra-bay-area/events/303469006/>, > > > Since things are massively overhauled, we believe it is almost ready for a > final pass pre-VOTE. We would like you to please review the CEP-37 > <https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-37+Apache+Cassandra+Unified+Repair+Solution)> > and the associated detailed design doc > <https://docs.google.com/document/d/1CJWxjEi-mBABPMZ3VWJ9w5KavWfJETAGxfUpsViPcPo/edit#heading=h.r112r46toau0> > . > > Thank you everyone! > > Chris, Andy, Josh, Dinesh, Kristijonas, and Jaydeep > > > > On Thu, Sep 19, 2024 at 11:26 AM Josh McKenzie <jmcken...@apache.org> > wrote: > >> Not quite; finishing touches on the CEP and design doc are in flight (as >> of last / this week). >> >> Soon(tm). >> >> On Thu, Sep 19, 2024, at 2:07 PM, Patrick McFadin wrote: >> >> Is this CEP ready for a VOTE thread? >> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-37+%28DRAFT%29+Apache+Cassandra+Unified+Repair+Solution >> >> On Sun, Feb 25, 2024 at 12:25 PM Jaydeep Chovatia < >> chovatia.jayd...@gmail.com> wrote: >> >> Thanks, Josh. I've just updated the CEP >> <https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-37+%28DRAFT%29+Apache+Cassandra+Official+Repair+Solution> >> and included all the solutions you mentioned below. >> >> Jaydeep >> >> On Thu, Feb 22, 2024 at 9:33 AM Josh McKenzie <jmcken...@apache.org> >> wrote: >> >> >> Very late response from me here (basically necro'ing this thread). >> >> I think it'd be useful to get this condensed into a CEP that we can then >> discuss in that format. It's clearly something we all agree we need and >> having an implementation that works, even if it's not in your preferred >> execution domain, is vastly better than nothing IMO. >> >> I don't have cycles (nor background ;) ) to do that, but it sounds like >> you do Jaydeep given the implementation you have on a private fork + design. >> >> A non-exhaustive list of things that might be useful incorporating into >> or referencing from a CEP: >> Slack thread: >> https://the-asf.slack.com/archives/CK23JSY2K/p1690225062383619 >> Joey's old C* ticket: >> https://issues.apache.org/jira/browse/CASSANDRA-14346 >> Even older automatic repair scheduling: >> https://issues.apache.org/jira/browse/CASSANDRA-10070 >> Your design gdoc: >> https://docs.google.com/document/d/1CJWxjEi-mBABPMZ3VWJ9w5KavWfJETAGxfUpsViPcPo/edit#heading=h.r112r46toau0 >> PR with automated repair: >> https://github.com/jaydeepkumar1984/cassandra/commit/ef6456d652c0d07cf29d88dfea03b73704814c2c >> >> My intuition is that we're all basically in agreement that this is >> something the DB needs, we're all willing to bikeshed for our personal >> preference on where it lives and how it's implemented, and at the end of >> the day, code talks. I don't think anyone's said they'll die on the hill of >> implementation details, so that feels like CEP time to me. >> >> If you were willing and able to get a CEP together for automated repair >> based on the above material, given you've done the work and have the proof >> points it's working at scale, I think this would be a *huge contribution* >> to the community. >> >> On Thu, Aug 24, 2023, at 7:26 PM, Jaydeep Chovatia wrote: >> >> Is anyone going to file an official CEP for this? >> As mentioned in this email thread, here is one of the solution's design >> doc >> <https://docs.google.com/document/d/1CJWxjEi-mBABPMZ3VWJ9w5KavWfJETAGxfUpsViPcPo/edit#heading=h.r112r46toau0> >> and source code on a private Apache Cassandra patch. Could you go through >> it and let me know what you think? >> >> Jaydeep >> >> On Wed, Aug 2, 2023 at 3:54 PM Jon Haddad <rustyrazorbl...@apache.org> >> wrote: >> >> > That said I would happily support an effort to bring repair scheduling >> to the sidecar immediately. This has nothing blocking it, and would >> potentially enable the sidecar to provide an official repair scheduling >> solution that is compatible with current or even previous versions of the >> database. >> >> This is something I hadn't thought much about, and is a pretty good >> argument for using the sidecar initially. There's a lot of deployments out >> there and having an official repair option would be a big win. >> >> >> On 2023/07/26 23:20:07 "C. Scott Andreas" wrote: >> > I agree that it would be ideal for Cassandra to have a repair scheduler >> in-DB. >> > >> > That said I would happily support an effort to bring repair scheduling >> to the sidecar immediately. This has nothing blocking it, and would >> potentially enable the sidecar to provide an official repair scheduling >> solution that is compatible with current or even previous versions of the >> database. >> > >> > Once TCM has landed, we’ll have much stronger primitives for repair >> orchestration in the database itself. But I don’t think that should block >> progress on a repair scheduling solution in the sidecar, and there is >> nothing that would prevent someone from continuing to use a sidecar-based >> solution in perpetuity if they preferred. >> > >> > - Scott >> > >> > > On Jul 26, 2023, at 3:25 PM, Jon Haddad <rustyrazorbl...@apache.org> >> wrote: >> > > >> > > I'm 100% in favor of repair being part of the core DB, not the >> sidecar. The current (and past) state of things where running the DB >> correctly *requires* running a separate process (either community >> maintained or official C* sidecar) is incredibly painful for folks. The >> idea that your data integrity needs to be opt-in has never made sense to me >> from the perspective of either the product or the end user. >> > > >> > > I've worked with way too many teams that have either configured this >> incorrectly or not at all. >> > > >> > > Ideally Cassandra would ship with repair built in and on by default. >> Power users can disable if they want to continue to maintain their own >> repair tooling for some reason. >> > > >> > > Jon >> > > >> > >> On 2023/07/24 20:44:14 German Eichberger via dev wrote: >> > >> All, >> > >> We had a brief discussion in [2] about the Uber article [1] where >> they talk about having integrated repair into Cassandra and how great that >> is. I expressed my disappointment that they didn't work with the community >> on that (Uber, if you are listening time to make amends 🙂) and it turns >> out Joey already had the idea and wrote the code [3] - so I wanted to start >> a discussion to gauge interest and maybe how to revive that effort. >> > >> Thanks, >> > >> German >> > >> [1] >> https://www.uber.com/blog/how-uber-optimized-cassandra-operations-at-scale/ >> > >> [2] https://the-asf.slack.com/archives/CK23JSY2K/p1690225062383619 >> > >> [3] https://issues.apache.org/jira/browse/CASSANDRA-14346 >> > >> >> >> >>