I just want to say I think it would be great for our users if we moved
repair scheduling into Cassandra itself. The team here at Netflix has
opened the ticket <https://issues.apache.org/jira/browse/CASSANDRA-14346>
and have written a detailed design document
<https://docs.google.com/document/d/1RV4rOrG1gwlD5IljmrIq_t45rz7H3xs9GbFSEyGzEtM/edit#heading=h.iasguic42ger>
that includes problem discussion and prior art if anyone wants to
contribute to that. We tried to fairly discuss existing solutions, what
their drawbacks are, and a proposed solution.

If we were to put this as part of the main Cassandra daemon, I think it
should probably be marked experimental and of course be something that
users opt into (table by table or cluster by cluster) with the
understanding that it might not fully work out of the box the first time we
ship it. We have to be willing to take risks but we also have to be honest
with our users. It may help build confidence if a few major deployments use
it (such as Netflix) and we are happy of course to provide that QA as best
we can.

-Joey

On Tue, Apr 3, 2018 at 10:48 AM, Blake Eggleston <beggles...@apple.com>
wrote:

> Hi dev@,
>
>
>
> The question of the best way to schedule repairs came up on
> CASSANDRA-14346, and I thought it would be good to bring up the idea of an
> external tool on the dev list.
>
>
>
> Cassandra lacks any sort of tools for automating routine tasks that are
> required for running clusters, specifically repair. Regular repair is a
> must for most clusters, like compaction. This means that, especially as far
> as eventual consistency is concerned, Cassandra isn’t totally functional
> out of the box. Operators either need to find a 3rd party solution or
> implement one themselves. Adding this to Cassandra would make it easier to
> use.
>
>
>
> Is this something we should be doing? If so, what should it look like?
>
>
>
> Personally, I feel like this is a pretty big gap in the project and would
> like to see an out of process tool offered. Ideally, Cassandra would just
> take care of itself, but writing a distributed repair scheduler that you
> trust to run in production is a lot harder than writing a single process
> management application that can failover.
>
>
>
> Any thoughts on this?
>
>
>
> Thanks,
>
>
>
> Blake
>
>

Reply via email to