Re: [Discuss] Repair inside C*

Josh McKenzie Fri, 23 Feb 2024 06:30:32 -0800

> we're all willing to bikeshed for our personal preference on where it lives 
> and how it's implemented, and at the end of the day, code talks. I don't 
> think anyone's said they'll die on the hill of implementation details


:D

I don't think we're going to be able to reach a consensus on an email thread 
with higher level abstractions and indicative statements. For instance: "a lot 
of complexity around repair in the main process" vs. "a lot of complexity in 
signaling between a sidecar and a main process and supporting multiple versions 
of C*". Both resonate with me at face value and neither contain enough detail 
to weigh against one another.

A more granular, lower level CEP that includes a tradeoff of the two designs 
with a recommendation on a path forward might help unstick us from the ML 
back-and-forth.

We could also take an indicative vote on "in-process vs. in-sidecar" to see if 
we can get a read on temperature.

On Thu, Feb 22, 2024, at 2:06 PM, Paulo Motta wrote:
> Apologies, I just read the previous message and missed the previous 
> discussion on sidecar vs main process on this thread. :-)
> 
> It does not look like a final agreement was reached about this and there are 
> lots of good arguments for both sides, but perhaps it would be nice to agree 
> on this before a CEP is proposed since this will significantly influence the 
> initial design?
> 
> I tend to agree with Dinesh and Scott's pragmatic stance of providing initial 
> support to repair scheduling on the sidecar, since this has fewer 
> dependencies, and progressively move what makes sense to the main process as 
> TCM/Accord primitives become available and mature.
> 
> On Thu, Feb 22, 2024 at 1:44 PM Paulo Motta <pa...@apache.org> wrote:
>> +1 to Josh's points,  The project has considered native repair scheduling 
>> for a long time but it was never made a reality due to the complex 
>> considerations involved and availability of custom implementations/tools 
>> like cassandra-reaper, which is a popular way of scheduling repairs in 
>> Cassandra.
>> 
>> Unfortunately I did not have cycles to review this proposal, but it looks 
>> promising from a quick glance.
>> 
>> One important consideration that I think we need to discuss is: where should 
>> repair scheduling live: in the main process or the sidecar?
>> 
>> I think there is a lot of complexity around repair in the main process and 
>> we need to be extra careful about adding additional complexity on top of 
>> that.
>> 
>> Perhaps this could be a good opportunity to consider the sidecar to host 
>> repair scheduling, since this looks to be a control plane responsibility? 
>> One downside is that this would not make repair scheduling available to 
>> users who do not use the sidecar.
>> 
>> What do you think? It would be great to have input from sidecar maintainers 
>> if this is something that would make sense for that subproject.
>> 
>> On Thu, Feb 22, 2024 at 12:33 PM Josh McKenzie <jmcken...@apache.org> wrote:
>>> __
>>> Very late response from me here (basically necro'ing this thread).
>>> 
>>> I think it'd be useful to get this condensed into a CEP that we can then 
>>> discuss in that format. It's clearly something we all agree we need and 
>>> having an implementation that works, even if it's not in your preferred 
>>> execution domain, is vastly better than nothing IMO.
>>> 
>>> I don't have cycles (nor background ;) ) to do that, but it sounds like you 
>>> do Jaydeep given the implementation you have on a private fork + design.
>>> 
>>> A non-exhaustive list of things that might be useful incorporating into or 
>>> referencing from a CEP:
>>> Slack thread: https://the-asf.slack.com/archives/CK23JSY2K/p1690225062383619
>>> Joey's old C* ticket: https://issues.apache.org/jira/browse/CASSANDRA-14346
>>> Even older automatic repair scheduling: 
>>> https://issues.apache.org/jira/browse/CASSANDRA-10070
>>> Your design gdoc: 
>>> https://docs.google.com/document/d/1CJWxjEi-mBABPMZ3VWJ9w5KavWfJETAGxfUpsViPcPo/edit#heading=h.r112r46toau0
>>> PR with automated repair: 
>>> https://github.com/jaydeepkumar1984/cassandra/commit/ef6456d652c0d07cf29d88dfea03b73704814c2c
>>> 
>>> My intuition is that we're all basically in agreement that this is 
>>> something the DB needs, we're all willing to bikeshed for our personal 
>>> preference on where it lives and how it's implemented, and at the end of 
>>> the day, code talks. I don't think anyone's said they'll die on the hill of 
>>> implementation details, so that feels like CEP time to me.
>>> 
>>> If you were willing and able to get a CEP together for automated repair 
>>> based on the above material, given you've done the work and have the proof 
>>> points it's working at scale, I think this would be a *huge contribution* 
>>> to the community.
>>> 
>>> On Thu, Aug 24, 2023, at 7:26 PM, Jaydeep Chovatia wrote:
>>>> Is anyone going to file an official CEP for this?
>>>> As mentioned in this email thread, here is one of the solution's design 
>>>> doc 
>>>> <https://docs.google.com/document/d/1CJWxjEi-mBABPMZ3VWJ9w5KavWfJETAGxfUpsViPcPo/edit#heading=h.r112r46toau0>
>>>>  and source code on a private Apache Cassandra patch. Could you go through 
>>>> it and let me know what you think?
>>>> 
>>>> Jaydeep
>>>> 
>>>> On Wed, Aug 2, 2023 at 3:54 PM Jon Haddad <rustyrazorbl...@apache.org> 
>>>> wrote:
>>>>> > That said I would happily support an effort to bring repair scheduling 
>>>>> > to the sidecar immediately. This has nothing blocking it, and would 
>>>>> > potentially enable the sidecar to provide an official repair scheduling 
>>>>> > solution that is compatible with current or even previous versions of 
>>>>> > the database.
>>>>> 
>>>>> This is something I hadn't thought much about, and is a pretty good 
>>>>> argument for using the sidecar initially.  There's a lot of deployments 
>>>>> out there and having an official repair option would be a big win. 
>>>>> 
>>>>> 
>>>>> On 2023/07/26 23:20:07 "C. Scott Andreas" wrote:
>>>>> > I agree that it would be ideal for Cassandra to have a repair scheduler 
>>>>> > in-DB.
>>>>> >
>>>>> > That said I would happily support an effort to bring repair scheduling 
>>>>> > to the sidecar immediately. This has nothing blocking it, and would 
>>>>> > potentially enable the sidecar to provide an official repair scheduling 
>>>>> > solution that is compatible with current or even previous versions of 
>>>>> > the database.
>>>>> >
>>>>> > Once TCM has landed, we’ll have much stronger primitives for repair 
>>>>> > orchestration in the database itself. But I don’t think that should 
>>>>> > block progress on a repair scheduling solution in the sidecar, and 
>>>>> > there is nothing that would prevent someone from continuing to use a 
>>>>> > sidecar-based solution in perpetuity if they preferred.
>>>>> >
>>>>> > - Scott
>>>>> >
>>>>> > > On Jul 26, 2023, at 3:25 PM, Jon Haddad <rustyrazorbl...@apache.org> 
>>>>> > > wrote:
>>>>> > >
>>>>> > > I'm 100% in favor of repair being part of the core DB, not the 
>>>>> > > sidecar.  The current (and past) state of things where running the DB 
>>>>> > > correctly *requires* running a separate process (either community 
>>>>> > > maintained or official C* sidecar) is incredibly painful for folks.  
>>>>> > > The idea that your data integrity needs to be opt-in has never made 
>>>>> > > sense to me from the perspective of either the product or the end 
>>>>> > > user.
>>>>> > >
>>>>> > > I've worked with way too many teams that have either configured this 
>>>>> > > incorrectly or not at all. 
>>>>> > >
>>>>> > > Ideally Cassandra would ship with repair built in and on by default.  
>>>>> > > Power users can disable if they want to continue to maintain their 
>>>>> > > own repair tooling for some reason.
>>>>> > >
>>>>> > > Jon
>>>>> > >
>>>>> > >> On 2023/07/24 20:44:14 German Eichberger via dev wrote:
>>>>> > >> All,
>>>>> > >> We had a brief discussion in [2] about the Uber article [1] where 
>>>>> > >> they talk about having integrated repair into Cassandra and how 
>>>>> > >> great that is. I expressed my disappointment that they didn't work 
>>>>> > >> with the community on that (Uber, if you are listening time to make 
>>>>> > >> amends 🙂) and it turns out Joey already had the idea and wrote the 
>>>>> > >> code [3] - so I wanted to start a discussion to gauge interest and 
>>>>> > >> maybe how to revive that effort.
>>>>> > >> Thanks,
>>>>> > >> German
>>>>> > >> [1] 
>>>>> > >> https://www.uber.com/blog/how-uber-optimized-cassandra-operations-at-scale/
>>>>> > >> [2] https://the-asf.slack.com/archives/CK23JSY2K/p1690225062383619
>>>>> > >> [3] https://issues.apache.org/jira/browse/CASSANDRA-14346
>>>>> >
>>>

Re: [Discuss] Repair inside C*

Reply via email to