Re: [Discuss] Repair inside C*

Štefan Miklošovič Fri, 23 Feb 2024 07:00:35 -0800

There are already some community solutions to scheduled repairs like this
(1), it runs along Cassandra node though ... anyway. I would like to see
that we pick what is the best already out there and try to integrate it
rather than trying to figure it all out again. That seems like a waste of
time and resources. If there is already something which "works" it would be
cool to spend some time first to get as much value from it as possible.


Just my 2 cents here

(1) https://github.com/Ericsson/ecchronos

On Fri, Feb 23, 2024 at 3:31 PM Josh McKenzie <jmcken...@apache.org> wrote:

> we're all willing to bikeshed for our personal preference on where it
> lives and how it's implemented, and at the end of the day, code talks. I
> don't think anyone's said they'll die on the hill of implementation details
>
>
> :D
>
> I don't think we're going to be able to reach a consensus on an email
> thread with higher level abstractions and indicative statements. For
> instance: "a lot of complexity around repair in the main process" vs. "a
> lot of complexity in signaling between a sidecar and a main process and
> supporting multiple versions of C*". Both resonate with me at face value
> and neither contain enough detail to weigh against one another.
>
> A more granular, lower level CEP that includes a tradeoff of the two
> designs with a recommendation on a path forward might help unstick us from
> the ML back-and-forth.
>
> We could also take an indicative vote on "in-process vs. in-sidecar" to
> see if we can get a read on temperature.
>
> On Thu, Feb 22, 2024, at 2:06 PM, Paulo Motta wrote:
>
> Apologies, I just read the previous message and missed the previous
> discussion on sidecar vs main process on this thread. :-)
>
> It does not look like a final agreement was reached about this and there
> are lots of good arguments for both sides, but perhaps it would be nice to
> agree on this before a CEP is proposed since this will significantly
> influence the initial design?
>
> I tend to agree with Dinesh and Scott's pragmatic stance of providing
> initial support to repair scheduling on the sidecar, since this has fewer
> dependencies, and progressively move what makes sense to the main process
> as TCM/Accord primitives become available and mature.
>
> On Thu, Feb 22, 2024 at 1:44 PM Paulo Motta <pa...@apache.org> wrote:
>
> +1 to Josh's points,  The project has considered native repair scheduling
> for a long time but it was never made a reality due to the complex
> considerations involved and availability of custom implementations/tools
> like cassandra-reaper, which is a popular way of scheduling repairs in
> Cassandra.
>
> Unfortunately I did not have cycles to review this proposal, but it looks
> promising from a quick glance.
>
> One important consideration that I think we need to discuss is: where
> should repair scheduling live: in the main process or the sidecar?
>
> I think there is a lot of complexity around repair in the main process and
> we need to be extra careful about adding additional complexity on top of
> that.
>
> Perhaps this could be a good opportunity to consider the sidecar to host
> repair scheduling, since this looks to be a control plane responsibility?
> One downside is that this would not make repair scheduling available to
> users who do not use the sidecar.
>
> What do you think? It would be great to have input from sidecar
> maintainers if this is something that would make sense for that subproject.
>
> On Thu, Feb 22, 2024 at 12:33 PM Josh McKenzie <jmcken...@apache.org>
> wrote:
>
>
> Very late response from me here (basically necro'ing this thread).
>
> I think it'd be useful to get this condensed into a CEP that we can then
> discuss in that format. It's clearly something we all agree we need and
> having an implementation that works, even if it's not in your preferred
> execution domain, is vastly better than nothing IMO.
>
> I don't have cycles (nor background ;) ) to do that, but it sounds like
> you do Jaydeep given the implementation you have on a private fork + design.
>
> A non-exhaustive list of things that might be useful incorporating into or
> referencing from a CEP:
> Slack thread:
> https://the-asf.slack.com/archives/CK23JSY2K/p1690225062383619
> Joey's old C* ticket:
> https://issues.apache.org/jira/browse/CASSANDRA-14346
> Even older automatic repair scheduling:
> https://issues.apache.org/jira/browse/CASSANDRA-10070
> Your design gdoc:
> https://docs.google.com/document/d/1CJWxjEi-mBABPMZ3VWJ9w5KavWfJETAGxfUpsViPcPo/edit#heading=h.r112r46toau0
> PR with automated repair:
> https://github.com/jaydeepkumar1984/cassandra/commit/ef6456d652c0d07cf29d88dfea03b73704814c2c
>
> My intuition is that we're all basically in agreement that this is
> something the DB needs, we're all willing to bikeshed for our personal
> preference on where it lives and how it's implemented, and at the end of
> the day, code talks. I don't think anyone's said they'll die on the hill of
> implementation details, so that feels like CEP time to me.
>
> If you were willing and able to get a CEP together for automated repair
> based on the above material, given you've done the work and have the proof
> points it's working at scale, I think this would be a *huge contribution*
> to the community.
>
> On Thu, Aug 24, 2023, at 7:26 PM, Jaydeep Chovatia wrote:
>
> Is anyone going to file an official CEP for this?
> As mentioned in this email thread, here is one of the solution's design
> doc
> <https://docs.google.com/document/d/1CJWxjEi-mBABPMZ3VWJ9w5KavWfJETAGxfUpsViPcPo/edit#heading=h.r112r46toau0>
> and source code on a private Apache Cassandra patch. Could you go through
> it and let me know what you think?
>
> Jaydeep
>
> On Wed, Aug 2, 2023 at 3:54 PM Jon Haddad <rustyrazorbl...@apache.org>
> wrote:
>
> > That said I would happily support an effort to bring repair scheduling
> to the sidecar immediately. This has nothing blocking it, and would
> potentially enable the sidecar to provide an official repair scheduling
> solution that is compatible with current or even previous versions of the
> database.
>
> This is something I hadn't thought much about, and is a pretty good
> argument for using the sidecar initially.  There's a lot of deployments out
> there and having an official repair option would be a big win.
>
>
> On 2023/07/26 23:20:07 "C. Scott Andreas" wrote:
> > I agree that it would be ideal for Cassandra to have a repair scheduler
> in-DB.
> >
> > That said I would happily support an effort to bring repair scheduling
> to the sidecar immediately. This has nothing blocking it, and would
> potentially enable the sidecar to provide an official repair scheduling
> solution that is compatible with current or even previous versions of the
> database.
> >
> > Once TCM has landed, we’ll have much stronger primitives for repair
> orchestration in the database itself. But I don’t think that should block
> progress on a repair scheduling solution in the sidecar, and there is
> nothing that would prevent someone from continuing to use a sidecar-based
> solution in perpetuity if they preferred.
> >
> > - Scott
> >
> > > On Jul 26, 2023, at 3:25 PM, Jon Haddad <rustyrazorbl...@apache.org>
> wrote:
> > >
> > > I'm 100% in favor of repair being part of the core DB, not the
> sidecar.  The current (and past) state of things where running the DB
> correctly *requires* running a separate process (either community
> maintained or official C* sidecar) is incredibly painful for folks.  The
> idea that your data integrity needs to be opt-in has never made sense to me
> from the perspective of either the product or the end user.
> > >
> > > I've worked with way too many teams that have either configured this
> incorrectly or not at all.
> > >
> > > Ideally Cassandra would ship with repair built in and on by default.
> Power users can disable if they want to continue to maintain their own
> repair tooling for some reason.
> > >
> > > Jon
> > >
> > >> On 2023/07/24 20:44:14 German Eichberger via dev wrote:
> > >> All,
> > >> We had a brief discussion in [2] about the Uber article [1] where
> they talk about having integrated repair into Cassandra and how great that
> is. I expressed my disappointment that they didn't work with the community
> on that (Uber, if you are listening time to make amends 🙂) and it turns
> out Joey already had the idea and wrote the code [3] - so I wanted to start
> a discussion to gauge interest and maybe how to revive that effort.
> > >> Thanks,
> > >> German
> > >> [1]
> https://www.uber.com/blog/how-uber-optimized-cassandra-operations-at-scale/
> > >> [2] https://the-asf.slack.com/archives/CK23JSY2K/p1690225062383619
> > >> [3] https://issues.apache.org/jira/browse/CASSANDRA-14346
> >
>
>
>
>

Re: [Discuss] Repair inside C*

Reply via email to