Re: Repair scheduling tools

Jon Haddad Wed, 04 Apr 2018 07:45:30 -0700

Implementation details aside, I’m firmly in the “it would be nice of C* could 
take care of it” camp.  Reaper is pretty damn easy to use and people *still* 
don’t put it in prod.



> On Apr 4, 2018, at 4:16 AM, Rahul Singh <[email protected]> wrote:
> 
> I understand the merits of both approaches. In working with other DBs In the 
> “old country” of SQL, we often had to write indexing sequences manually for 
> important tables. It was “built into the product” but in order to leverage 
> the maximum benefits of indices we had to have different indices other than 
> the clustered (physical index). The process still sucked. It’s never perfect.
> 
> The JVM is already fraught with GC issues and putting another process being 
> managed in the same heapspace is what I’m worried about. Technically the 
> process could be in the same binary but started as a side Car or in the same 
> main process.
> 
> Consider a process called “cassandra-agent” that’s sitting around with a 
> scheduler based on config or a Cassandra table. Distributed in the same 
> release. Shell / service scripts would start it. The end user knows it only 
> by examining the .sh files. This opens possibilities of including a GUI 
> hosted in the same process without cluttering the core coolness of Cassandra.
> 
> Best,
> 
> --
> Rahul Singh
> [email protected]
> 
> Anant Corporation
> 
> On Apr 4, 2018, 2:50 AM -0400, Dor Laor <[email protected]>, wrote:
>> We at Scylla, implemented repair in a similar way to the Cassandra reaper.
>> We do
>> that using an external application, written in go that manages repair for
>> multiple clusters
>> and saves the data in an external Scylla cluster. The logic resembles the
>> reaper one with
>> some specific internal sharding optimizations and uses the Scylla rest api.
>> 
>> However, I have doubts it's the ideal way. After playing a bit with
>> CockroachDB, I realized
>> it's super nice to have a single binary that repairs itself, provides a GUI
>> and is the core DB.
>> 
>> Even while distributed, you can elect a leader node to manage the repair in
>> a consistent
>> way so the complexity can be reduced to a minimum. Repair can write its
>> status to the
>> system tables and to provide an api for progress, rate control, etc.
>> 
>> The big advantage for repair to embedded in the core is that there is no
>> need to expose
>> internal state to the repair logic. So an external program doesn't need to
>> deal with different
>> version of Cassandra, different repair capabilities of the core (such as
>> incremental on/off)
>> and so forth. A good database should schedule its own repair, it knows
>> whether the shreshold
>> of hintedhandoff was cross or not, it knows whether nodes where replaced,
>> etc,
>> 
>> My 2 cents. Dor
>> 
>> On Tue, Apr 3, 2018 at 11:13 PM, Dinesh Joshi <
>> [email protected]> wrote:
>> 
>>> Simon,
>>> You could still do load aware repair outside of the main process by
>>> reading Cassandra's metrics.
>>> In general, I don't think the maintenance tasks necessarily need to live
>>> in the main process. They could negatively impact the read / write path.
>>> Unless strictly required by the serving path, it could live in a sidecar
>>> process. There are multiple benefits including isolation, faster iteration,
>>> loose coupling. For example - this would mean that the maintenance tasks
>>> can have a different gc profile than the main process and it would be ok.
>>> Today that is not the case.
>>> The only issue I see is that the project does not provide an official
>>> sidecar. Perhaps there should be one. We probably would've not had to have
>>> this discussion ;)
>>> Dinesh
>>> 
>>> On Tuesday, April 3, 2018, 10:12:56 PM PDT, Qingcun Zhou <
>>> [email protected]> wrote:
>>> 
>>> Repair has been a problem for us at Uber. In general I'm in favor of
>>> including the scheduling logic in Cassandra daemon. It has the benefit of
>>> introducing something like load-aware repair, eg, only schedule repair
>>> while no ongoing compaction or traffic is low, etc. As proposed by others,
>>> we can expose keyspace/table-level configurations so that users can opt-in.
>>> Regarding the risk, yes there will be problems at the beginning but in the
>>> long run, users will appreciate that repair works out of the box, just like
>>> compaction. We have large Cassandra deployments and can work with Netflix
>>> folks for intensive testing to boost user confidence.
>>> 
>>> On the other hand, have we looked into how other NoSQL databases do repair?
>>> Is there a side car process?
>>> 
>>> 
>>> On Tue, Apr 3, 2018 at 9:21 PM, sankalp kohli <[email protected]
>>> wrote:
>>> 
>>>> Repair is critical for running C* and I agree with Roopa that it needs to
>>>> be part of the offering. I think we should make it easy for new users to
>>>> run C*.
>>>> 
>>>> Can we have a side car process which we can add to Apache Cassandra
>>>> offering and we can put this repair their? I am also fine putting it in
>>> C*
>>>> if side car is more long term.
>>>> 
>>>> On Tue, Apr 3, 2018 at 6:20 PM, Roopa Tangirala <
>>>> [email protected]> wrote:
>>>> 
>>>>> In seeing so many companies grapple with running repairs successfully
>>> in
>>>>> production, and seeing the success of distributed scheduled repair here
>>>> at
>>>>> Netflix, I strongly believe that adding this to Cassandra would be a
>>>> great
>>>>> addition to the database. I am hoping, we as a community will make it
>>>> easy
>>>>> for teams to operate and run Cassandra by enhancing the core product,
>>> and
>>>>> making the maintenances like repairs and compactions part of the
>>> database
>>>>> without external tooling. We can have an experimental flag for the
>>>> feature
>>>>> and only teams who are confident with the service can enable them,
>>> while
>>>>> others can fall back to default repairs.
>>>>> 
>>>>> 
>>>>> *Regards,*
>>>>> 
>>>>> *Roopa Tangirala*
>>>>> 
>>>>> Engineering Manager CDE
>>>>> 
>>>>> *(408) 438-3156 - mobile*
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On Tue, Apr 3, 2018 at 4:19 PM, Kenneth Brotman <
>>>>> [email protected]> wrote:
>>>>> 
>>>>>> Why not make it configurable?
>>>>>> auto_manage_repair_consistancy: true (default: false)
>>>>>> 
>>>>>> Then users can use the built in auto repair function that would be
>>>>> created
>>>>>> or continue to handle it as now. Default behavior would be "false"
>>> so
>>>>>> nothing changes on its own. Just wondering why not have that option?
>>>> It
>>>>>> might accelerate progress as others have already suggested.
>>>>>> 
>>>>>> Kenneth Brotman
>>>>>> 
>>>>>> -----Original Message-----
>>>>>> From: Nate McCall [mailto:[email protected]]
>>>>>> Sent: Tuesday, April 03, 2018 1:37 PM
>>>>>> To: dev
>>>>>> Subject: Re: Repair scheduling tools
>>>>>> 
>>>>>> This document does a really good job of listing out some of the
>>> issues
>>>> of
>>>>>> coordinating scheduling repair. Regardless of which camp you fall
>>> into,
>>>>> it
>>>>>> is certainly worth a read.
>>>>>> 
>>>>>> On Wed, Apr 4, 2018 at 8:10 AM, Joseph Lynch <[email protected]
>>>>>> wrote:
>>>>>>> I just want to say I think it would be great for our users if we
>>>> moved
>>>>>>> repair scheduling into Cassandra itself. The team here at Netflix
>>> has
>>>>>>> opened the ticket
>>>>>>> <https://issues.apache.org/jira/browse/CASSANDRA-14346
>>>>>>> and have written a detailed design document
>>>>>>> <https://docs.google.com/document/d/1RV4rOrG1gwlD5IljmrIq_
>>>> t45rz7H3xs9G
>>>>>>> bFSEyGzEtM/edit#heading=h.iasguic42ger
>>>>>>> that includes problem discussion and prior art if anyone wants to
>>>>>>> contribute to that. We tried to fairly discuss existing solutions,
>>>>>>> what their drawbacks are, and a proposed solution.
>>>>>>> 
>>>>>>> If we were to put this as part of the main Cassandra daemon, I
>>> think
>>>>>>> it should probably be marked experimental and of course be
>>> something
>>>>>>> that users opt into (table by table or cluster by cluster) with the
>>>>>>> understanding that it might not fully work out of the box the first
>>>>>>> time we ship it. We have to be willing to take risks but we also
>>> have
>>>>>>> to be honest with our users. It may help build confidence if a few
>>>>>>> major deployments use it (such as Netflix) and we are happy of
>>> course
>>>>>>> to provide that QA as best we can.
>>>>>>> 
>>>>>>> -Joey
>>>>>>> 
>>>>>>> On Tue, Apr 3, 2018 at 10:48 AM, Blake Eggleston
>>>>>>> <[email protected]
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Hi dev@,
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> The question of the best way to schedule repairs came up on
>>>>>>>> CASSANDRA-14346, and I thought it would be good to bring up the
>>> idea
>>>>>>>> of an external tool on the dev list.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Cassandra lacks any sort of tools for automating routine tasks
>>> that
>>>>>>>> are required for running clusters, specifically repair. Regular
>>>>>>>> repair is a must for most clusters, like compaction. This means
>>>> that,
>>>>>>>> especially as far as eventual consistency is concerned, Cassandra
>>>>>>>> isn’t totally functional out of the box. Operators either need to
>>>>>>>> find a 3rd party solution or implement one themselves. Adding this
>>>> to
>>>>>>>> Cassandra would make it easier to use.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Is this something we should be doing? If so, what should it look
>>>> like?
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Personally, I feel like this is a pretty big gap in the project
>>> and
>>>>>>>> would like to see an out of process tool offered. Ideally,
>>> Cassandra
>>>>>>>> would just take care of itself, but writing a distributed repair
>>>>>>>> scheduler that you trust to run in production is a lot harder than
>>>>>>>> writing a single process management application that can failover.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Any thoughts on this?
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Blake
>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>>> ------------------------------------------------------------
>>> ---------
>>>>>> To unsubscribe, e-mail: [email protected]
>>>>>> For additional commands, e-mail: [email protected]
>>>>>> 
>>>>>> 
>>>>>> ------------------------------------------------------------
>>> ---------
>>>>>> To unsubscribe, e-mail: [email protected]
>>>>>> For additional commands, e-mail: [email protected]
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Thank you & Best Regards,
>>> --Simon (Qingcun) Zhou
>>> 


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Repair scheduling tools

Reply via email to