Re: [Discuss] Repair inside C*

Alexander Dejanovski Mon, 28 Oct 2024 21:52:02 -0700

>
> The scheduler repairs, by default, the primary ranges for all the nodes
> going through the repair. Since it uses the primary ranges, all the nodes
> repairing parallelly would not overlap in any form for the primary ranges.
> However, the replica set for the nodes going through repair may or may not
> overlap, but it totally depends on the cluster size and parallelism used.
> If a cluster is small, there is a possibility, but if it is large, the
> possibility reduces. Even if we go with a range-centric approach and if we
> repair N token ranges in parallel, there is no guarantee that their replica
> sets won't overlap for smaller clusters.


That's inaccurate, we can check the replica set for the subrange we're
about to run and see if it overlaps with the replica set of other ranges
which are being processed already.


The only solution is to reduce the repair parallelism to one. node at a
> time.

Yes, I agree.

This is supported with the MVP, we can set "min_repair_interval: 7d"  (the
> default is 24h) and the nodes will repair only once every 7 days.
>
The MVP implementation allows running full and incremental repairs (and
> Preview repair code changes are done and it is coming soon) independently
> and in parallel. One can set the above config for each repair type with
> their preferred schedule.

Nice, sorry I missed these in the CEP doc.


>  I have already created a ticket to add this as an enhancement
> https://issues.apache.org/jira/browse/CASSANDRA-20013

Thanks,  table level repair priority could be a very interesting
improvement, that's something Reaper lacks as well at the moment.

If a repair session finishes gracefully, then this timeout is not
> applicable. Anyway, I do not have any strong opinion on the value. I am
> open to lowering it to *1h* or something.

True, it will only delay killing hanging repairs.
One thing that we cannot solve in Reaper at the moment is that sequential
and dc aware repair sessions that get terminated due to the timeout leave
ephemeral snapshots behind. Since they're only reclaimed on restart, having
a lot of timeouts can end up filling the the disks if the snapshots get
materialized.
Since the auto repair is running from within Cassandra, we might have more
control over this and implement a proper cleanup of such snapshots.


Alexander Dejanovski

Astra Managed Clusters / Mission Control

w. www.datastax.com

         <https://www.datastax.com/lp/astra-registration>


On Mon, Oct 28, 2024 at 7:01 PM Jaydeep Chovatia <chovatia.jayd...@gmail.com>
wrote:

> Thanks a lot, Alexander, for the review! Please find my response below:
>
> >  making these replicas process 3 concurrent repairs while others could
> be left uninvolved in any repair at all...Taking a range centric approach
> (we're not repairing nodes, we're repairing the token ranges) allows to
> spread the load evenly without overlap in the replica sets.
> The scheduler repairs, by default, the primary ranges for all the nodes
> going through the repair. Since it uses the primary ranges, all the nodes
> repairing parallelly would not overlap in any form for the primary ranges.
> However, the replica set for the nodes going through repair may or may not
> overlap, but it totally depends on the cluster size and parallelism used.
> If a cluster is small, there is a possibility, but if it is large, the
> possibility reduces. Even if we go with a range-centric approach and if we
> repair N token ranges in parallel, there is no guarantee that their replica
> sets won't overlap for smaller clusters.
>
> > I'm more worried even with incremental repair here, because you might
> end up with some conflicts around sstables which would be in the pending
> repair pool but would be needed by a competing repair job.
> This can happen regardless of whether we go by "node-centric" vs.
> "range-centric" if we run multiple parallel repair sessions. The reason is
> that SSTables for all the nodes going through repair may not be physically
> isolated 1:1 as per the token ranges being repaired. We just had a detailed
> discussion about the SSTable overlap for incremental repair (IR) last week
> in Slack (#cassandra-repair-scheduling-cep37), and the general consensus
> was that there is no better way to address it than just to retry a few
> times. The only solution is to reduce the repair parallelism to one. node
> at a time.
> The ideal and reliable way to repair IR is to calculate the token ranges
> based on the unrepaired data size and also apply the upper cap on the data
> size being repaired. The good news is that Andy Tolbert already extended
> the CEP-37 MVP for this, and he is working on making it perfect by adding
> necessary tests, etc., so it can be landed on top of this MVP. tl;dr Andy T
> and Chris L are already on top of this and soon it will be available on top
> of CEP-37 MVP.
>
> >I don't know if in the latest versions such sstables would be totally
> ignored or if the competing repair job would fail.
> The competing IR session would be aborted, and the scheduler would retry a
> few times.
>
> >Continuous repair might create a lot of overhead for full repairs which
> often don't require more than 1 run per week.
> This is supported with the MVP, we can set "min_repair_interval: 7d"
> (the default is 24h) and the nodes will repair only once every 7 days.
>
> >It also will not allow running a mix of scheduled full/incremental repairs
> The MVP implementation allows running full and incremental repairs (and
> Preview repair code changes are done and it is coming soon) independently
> and in parallel. One can set the above config for each repair type with
> their preferred schedule.
>
> >Here, nodes will be processed sequentially and each node will process the
> keyspaces sequentially, tying the repair cycle of all keyspaces together.
> The keyspaces and tables on each node will be randomly shuffled to avoid
> multiple nodes working on the same table/keyspaces.
>
> >There are many cases where one might have differentiated gc_grace_seconds
> settings to optimize reclaiming tombstones when applicable. That requires
> having some fine control over the repair cycle for a given keyspace/set of
> tables.
> As I mentioned, there is already a way to schedule a frequency of repair
> cycle, but the frequency is currently a global config on a node; hence
> applicable to all the tables on a node. However, the MVP design is flexible
> enough to be easily extended to add the schedule as a new CQL table-level
> property, which will then honor the table-level schedule as opposed to a
> global schedule. There was another suggestion from @masokol (from
> Ecchronos) to maybe assign a repair priority on a table level to prioritize
> one table over the other, and that can also solve this problem, which is
> also feasible on top of the MVP. I have already created a ticket to add
> this as an enhancement
> https://issues.apache.org/jira/browse/CASSANDRA-20013
>
> >I think the 3 hours timeout might be quite large and probably means a lot
> of data is being repaired for each split. That usually involves some level
> of overstreaming
> This timeout applies to unstuck stuck repair sessions due to some bug in
> the repair code path. e.g.
> https://issues.apache.org/jira/browse/CASSANDRA-14674
> If a repair session finishes gracefully, then this timeout is not
> applicable. Anyway, I do not have any strong opinion on the value. I am
> open to lowering it to *1h* or something.
>
> Jaydeep
>
> On Mon, Oct 28, 2024 at 4:45 AM Alexander DEJANOVSKI <
> adejanov...@gmail.com> wrote:
>
>> Hi Jaydeep,
>>
>> I've taken a look at the proposed design and have a few
>> comments/questions.
>> As one of the maintainers of Reaper, I'm looking this through the lens of
>> how Reaper does things.
>>
>>
>> *The approach taken in the CEP-37 design is "node-centric" vs a "range
>> centric" approach (which is the one Reaper takes).*I'm worried that this
>> will not allow spreading the repair load evenly across the cluster, since
>> nodes are the concurrency unit. You could allow running repair on 3 nodes
>> concurrently for example, but these 3 nodes could all involve the same
>> replicas, making these replicas process 3 concurrent repairs while others
>> could be left uninvolved in any repair at all.
>> Taking a range centric approach (we're not repairing nodes, we're
>> repairing the token ranges) allows to spread the load evenly without
>> overlap in the replica sets.
>> I'm more worried even with incremental repair here, because you might end
>> up with some conflicts around sstables which would be in the pending repair
>> pool but would be needed by a competing repair job.
>> I don't know if in the latest versions such sstables would be totally
>> ignored or if the competing repair job would fail.
>>
>> *Each repair command will repair all keyspaces (with the ability to fully
>> exclude some tables) and **I haven't seen a notion of schedule which
>> seems to suggest repairs are running continuously (unless I missed
>> something?).*
>> There are many cases where one might have differentiated gc_grace_seconds
>> settings to optimize reclaiming tombstones when applicable. That requires
>> having some fine control over the repair cycle for a given keyspace/set of
>> tables.
>> Here, nodes will be processed sequentially and each node will process the
>> keyspaces sequentially, tying the repair cycle of all keyspaces together.
>> If one of the ranges for a specific keyspace cannot be repaired within
>> the 3 hours timeout, it could block all the other keyspaces repairs.
>> Continuous repair might create a lot of overhead for full repairs which
>> often don't require more than 1 run per week.
>> It also will not allow running a mix of scheduled full/incremental
>> repairs (I'm unsure if that is still a recommendation, but it was still
>> recommended not so long ago)
>>
>> *The timeout base duration is large*
>> I think the 3 hours timeout might be quite large and probably means a lot
>> of data is being repaired for each split. That usually involves some level
>> of overstreaming. I don't have numbers to support this, it's more about my
>> own experience on sizing splits in production with Reaper to reduce the
>> impact as much as possible on cluster performance.
>> We use 30 minutes as default in Reaper with subsequent attempts growing
>> the timeout dynamically for challenging splits.
>>
>> Finally thanks for picking this up, I'm eager to see Reaper not being
>> needed anymore and having the database manage its own repairs!
>>
>>
>> Le mar. 22 oct. 2024 à 21:10, Benedict <bened...@apache.org> a écrit :
>>
>>> I realise it’s out of scope, but to counterbalance all of the
>>> pro-decomposition messages I wanted to chime in with a strong -1. But we
>>> can debate that in a suitable context later.
>>>
>>> On 22 Oct 2024, at 16:36, Jordan West <jw...@apache.org> wrote:
>>>
>>> 
>>> Agreed with the sentiment that decomposition is a good target but out of
>>> scope here. I’m personally excited to see an in-tree repair scheduler and
>>> am supportive of the approach shared here.
>>>
>>> Jordan
>>>
>>> On Tue, Oct 22, 2024 at 08:12 Dinesh Joshi <djo...@apache.org> wrote:
>>>
>>>> Decomposing Cassandra may be architecturally desirable but that is not
>>>> the goal of this CEP. This CEP brings value to operators today so it should
>>>> be considered on that merit. We definitely need to have a separate
>>>> conversation on Cassandra's architectural direction.
>>>>
>>>> On Tue, Oct 22, 2024 at 7:51 AM Joseph Lynch <joe.e.ly...@gmail.com>
>>>> wrote:
>>>>
>>>>> Definitely like this in C* itself. We only changed our proposal to
>>>>> putting repair scheduling in the sidecar before because trunk was frozen
>>>>> for the foreseeable future at that time. With trunk unfrozen and
>>>>> development on the main process going at a fast pace I think it makes way
>>>>> more sense to integrate natively as table properties as this CEP proposes.
>>>>> Completely agree the scheduling overhead should be minimal.
>>>>>
>>>>> Moving the actual repair operation (comparing data and streaming
>>>>> mismatches) along with compaction operations to a separate process long
>>>>> term makes a lot of sense but imo only once we both have a release of
>>>>> sidecar and a contract figured out between them on communication. I'm
>>>>> watching CEP-38 there as I think CQL and virtual tables are looking much
>>>>> stronger than when we wrote CEP-1 and chose HTTP but that's for that
>>>>> discussion and not this one.
>>>>>
>>>>> -Joey
>>>>>
>>>>> On Mon, Oct 21, 2024 at 3:25 PM Francisco Guerrero <fran...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> Like others have said, I was expecting the scheduling portion of
>>>>>> repair is
>>>>>> negligible. I was mostly curious if you had something handy that you
>>>>>> can
>>>>>> quickly share.
>>>>>>
>>>>>> On 2024/10/21 18:59:41 Jaydeep Chovatia wrote:
>>>>>> > >Jaydeep, do you have any metrics on your clusters comparing them
>>>>>> before
>>>>>> > and after introducing repair scheduling into the Cassandra process?
>>>>>> >
>>>>>> > Yes, I had made some comparisons when I started rolling this
>>>>>> feature out to
>>>>>> > our production five years ago :)  Here are the details:
>>>>>> > *The Scheduling*
>>>>>> > The scheduling itself is exceptionally lightweight, as only one
>>>>>> additional
>>>>>> > thread monitors the repair activity, updating the status to a
>>>>>> system table
>>>>>> > once every few minutes or so. So, it does not appear anywhere in
>>>>>> the CPU
>>>>>> > charts, etc. Unfortunately, I do not have those graphs now, but I
>>>>>> can do a
>>>>>> > quick comparison if it helps!
>>>>>> >
>>>>>> > *The Repair Itself*
>>>>>> > As we all know, the Cassandra repair algorithm is a heavy-weight
>>>>>> process
>>>>>> > due to Merkle tree/streaming, etc., no matter how we schedule it.
>>>>>> But it is
>>>>>> > an orthogonal topic and folks are already discussing creating a new
>>>>>> CEP.
>>>>>> >
>>>>>> > Jaydeep
>>>>>> >
>>>>>> >
>>>>>> > On Mon, Oct 21, 2024 at 10:02 AM Francisco Guerrero <
>>>>>> fran...@apache.org>
>>>>>> > wrote:
>>>>>> >
>>>>>> > > Jaydeep, do you have any metrics on your clusters comparing them
>>>>>> before
>>>>>> > > and after introducing repair scheduling into the Cassandra
>>>>>> process?
>>>>>> > >
>>>>>> > > On 2024/10/21 16:57:57 "J. D. Jordan" wrote:
>>>>>> > > > Sounds good. Just wanted to bring it up. I agree that the
>>>>>> scheduling bit
>>>>>> > > is
>>>>>> > > > pretty light weight and the ideal would be to bring the whole
>>>>>> of the
>>>>>> > > repair
>>>>>> > > > external, which is a much bigger can of worms to open.
>>>>>> > > >
>>>>>> > > >
>>>>>> > > >
>>>>>> > > > -Jeremiah
>>>>>> > > >
>>>>>> > > >
>>>>>> > > >
>>>>>> > > > > On Oct 21, 2024, at 11:21 AM, Chris Lohfink <
>>>>>> clohfin...@gmail.com>
>>>>>> > > wrote:
>>>>>> > > > >
>>>>>> > > > >
>>>>>> > > >
>>>>>> > > > > 
>>>>>> > > > >
>>>>>> > > > > > I actually think we should be looking at how we can move
>>>>>> things out
>>>>>> > > of the
>>>>>> > > > > database process.
>>>>>> > > > >
>>>>>> > > > >
>>>>>> > > > >
>>>>>> > > > >
>>>>>> > > > >
>>>>>> > > > > While worth pursuing, I think we would need a different CEP
>>>>>> just to
>>>>>> > > figure
>>>>>> > > > > out how to do that. Not only is there a lot of infrastructure
>>>>>> > > difficulty in
>>>>>> > > > > running multi process, the inter app communication needs to
>>>>>> be figured
>>>>>> > > out
>>>>>> > > > > better then JMX. Even the sidecar we dont have a solid story
>>>>>> on how to
>>>>>> > > > > ensure both are running or anything yet. It's up to each app
>>>>>> owner to
>>>>>> > > figure
>>>>>> > > > > it out. Once we have a good thing in place I think we can
>>>>>> start moving
>>>>>> > > > > compactions, repairs, etc out of the database. Even then it's
>>>>>> the
>>>>>> > > _repairs_
>>>>>> > > > > that is expensive, not the scheduling.
>>>>>> > > > >
>>>>>> > > > >
>>>>>> > > > >
>>>>>> > > > >
>>>>>> > > > > On Mon, Oct 21, 2024 at 9:45 AM Jeremiah Jordan
>>>>>> > > > > <[jeremiah.jor...@gmail.com](mailto:jeremiah.jor...@gmail.com
>>>>>> )>
>>>>>> > > wrote:
>>>>>> > > > >
>>>>>> > > > >
>>>>>> > > >
>>>>>> > > > >> I love the idea of a repair service being there by default
>>>>>> for an
>>>>>> > > install
>>>>>> > > > of C*.  My main concern here is that it is putting more
>>>>>> services into
>>>>>> > > the main
>>>>>> > > > database process.  I actually think we should be looking at how
>>>>>> we can
>>>>>> > > move
>>>>>> > > > things out of the database process.  The C* process being a
>>>>>> giant
>>>>>> > > monolith has
>>>>>> > > > always been a pain point.  Is there anyway it makes sense for
>>>>>> this to be
>>>>>> > > an
>>>>>> > > > external process rather than a new thread pool inside the C*
>>>>>> process?
>>>>>> > > >
>>>>>> > > > >>
>>>>>> > > >
>>>>>> > > > >>
>>>>>> > > > >
>>>>>> > > > >>
>>>>>> > > >
>>>>>> > > > >> -Jeremiah Jordan
>>>>>> > > >
>>>>>> > > > >>
>>>>>> > > >
>>>>>> > > > >>
>>>>>> > > > >
>>>>>> > > > >>
>>>>>> > > >
>>>>>> > > > >> On Oct 18, 2024 at 2:58:15 PM, Mick Semb Wever
>>>>>> > > > <[m...@apache.org](mailto:m...@apache.org)> wrote:
>>>>>> > > > >
>>>>>> > > > >>
>>>>>> > > >
>>>>>> > > > >>>
>>>>>> > > > >
>>>>>> > > > >>>
>>>>>> > > >
>>>>>> > > > >>> This is looking strong, thanks Jaydeep.
>>>>>> > > >
>>>>>> > > > >>>
>>>>>> > > >
>>>>>> > > > >>>
>>>>>> > > > >
>>>>>> > > > >>>
>>>>>> > > >
>>>>>> > > > >>> I would suggest folk take a look at the design doc and the
>>>>>> PR in the
>>>>>> > > CEP.
>>>>>> > > > A lot is there (that I have completely missed).
>>>>>> > > >
>>>>>> > > > >>>
>>>>>> > > >
>>>>>> > > > >>>
>>>>>> > > > >
>>>>>> > > > >>>
>>>>>> > > >
>>>>>> > > > >>> I would especially ask all authors of prior art (Reaper, DSE
>>>>>> > > nodesync,
>>>>>> > > > ecchronos)  to take a final review of the proposal
>>>>>> > > > >
>>>>>> > > > >>>
>>>>>> > > >
>>>>>> > > > >>>
>>>>>> > > > >
>>>>>> > > > >>>
>>>>>> > > >
>>>>>> > > > >>> Jaydeep, can we ask for a two week window while we reach
>>>>>> out to these
>>>>>> > > > people ?  There's a lot of prior art in this space, and it
>>>>>> feels like
>>>>>> > > we're in
>>>>>> > > > a good place now where it's clear this has legs and we can use
>>>>>> that to
>>>>>> > > bring
>>>>>> > > > folk in and make sure there's no remaining blindspots.
>>>>>> > > >
>>>>>> > > > >>>
>>>>>> > > >
>>>>>> > > > >>>
>>>>>> > > > >
>>>>>> > > > >>>
>>>>>> > > >
>>>>>> > > > >>>
>>>>>> > > > >
>>>>>> > > > >>>
>>>>>> > > >
>>>>>> > > > >>> On Fri, 18 Oct 2024 at 01:40, Jaydeep Chovatia
>>>>>> > > > <[chovatia.jayd...@gmail.com](mailto:chovatia.jayd...@gmail.com
>>>>>> )>
>>>>>> > > wrote:
>>>>>> > > > >
>>>>>> > > > >>>
>>>>>> > > >
>>>>>> > > > >>>> Sorry, there is a typo in the CEP-37 link; here is the
>>>>>> correct
>>>>>> > > > [link](
>>>>>> > >
>>>>>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-37+Apache+Cassandra+Unified+Repair+Solution
>>>>>> > > )
>>>>>> > > >
>>>>>> > > > >>>>
>>>>>> > > >
>>>>>> > > > >>>>
>>>>>> > > > >
>>>>>> > > > >>>>
>>>>>> > > >
>>>>>> > > > >>>>
>>>>>> > > > >
>>>>>> > > > >>>>
>>>>>> > > >
>>>>>> > > > >>>> On Thu, Oct 17, 2024 at 4:36 PM Jaydeep Chovatia
>>>>>> > > > <[chovatia.jayd...@gmail.com](mailto:chovatia.jayd...@gmail.com
>>>>>> )>
>>>>>> > > wrote:
>>>>>> > > > >
>>>>>> > > > >>>>
>>>>>> > > >
>>>>>> > > > >>>>> First, thank you for your patience while we strengthened
>>>>>> the
>>>>>> > > CEP-37.
>>>>>> > > >
>>>>>> > > > >>>>>
>>>>>> > > >
>>>>>> > > > >>>>>
>>>>>> > > > >
>>>>>> > > > >>>>>
>>>>>> > > >
>>>>>> > > > >>>>> Over the last eight months, Chris Lohfink, Andy Tolbert,
>>>>>> Josh
>>>>>> > > McKenzie,
>>>>>> > > > Dinesh Joshi, Kristijonas Zalys, and I have done tons of work
>>>>>> (online
>>>>>> > > > discussions/a dedicated Slack channel
>>>>>> > > #cassandra-repair-scheduling-cep37) to
>>>>>> > > > come up with the best possible design that not only
>>>>>> significantly
>>>>>> > > simplifies
>>>>>> > > > repair operations but also includes the most common features
>>>>>> that
>>>>>> > > everyone
>>>>>> > > > will benefit from running at Scale.
>>>>>> > > >
>>>>>> > > > >>>>>
>>>>>> > > >
>>>>>> > > > >>>>> For example,
>>>>>> > > >
>>>>>> > > > >>>>>
>>>>>> > > >
>>>>>> > > > >>>>>   * Apache Cassandra must be capable of running multiple
>>>>>> repair
>>>>>> > > types,
>>>>>> > > > such as Full, Incremental, Paxos, and Preview - so the
>>>>>> framework should
>>>>>> > > be
>>>>>> > > > easily extendable with no additional overhead from the
>>>>>> operator’s point
>>>>>> > > of
>>>>>> > > > view.
>>>>>> > > >
>>>>>> > > > >>>>>
>>>>>> > > >
>>>>>> > > > >>>>>   * An easy way to extend the token-split calculation
>>>>>> algorithm
>>>>>> > > with a
>>>>>> > > > default implementation should exist.
>>>>>> > > >
>>>>>> > > > >>>>>
>>>>>> > > >
>>>>>> > > > >>>>>   * Running incremental repair reliably at Scale is pretty
>>>>>> > > challenging,
>>>>>> > > > so we need to place safeguards, such as migration/rollback w/o
>>>>>> restart
>>>>>> > > and
>>>>>> > > > stopping incremental repair automatically if the disk is about
>>>>>> to get
>>>>>> > > full.
>>>>>> > > >
>>>>>> > > > >>>>>
>>>>>> > > >
>>>>>> > > > >>>>>
>>>>>> > > >
>>>>>> > > > >>>>>
>>>>>> > > >
>>>>>> > > > >>>>> We are glad to inform you that CEP-37 (i.e., Repair inside
>>>>>> > > Cassandra) is
>>>>>> > > > now officially ready for review after multiple rounds of design,
>>>>>> > > testing, code
>>>>>> > > > reviews, documentation reviews, and, more importantly,
>>>>>> validation that
>>>>>> > > it runs
>>>>>> > > > at Scale!
>>>>>> > > >
>>>>>> > > > >>>>>
>>>>>> > > >
>>>>>> > > > >>>>>
>>>>>> > > > >
>>>>>> > > > >>>>>
>>>>>> > > >
>>>>>> > > > >>>>> Some facts about CEP-37.
>>>>>> > > >
>>>>>> > > > >>>>>
>>>>>> > > >
>>>>>> > > > >>>>>   * Multiple members have verified all aspects of CEP-37
>>>>>> numerous
>>>>>> > > times.
>>>>>> > > >
>>>>>> > > > >>>>>
>>>>>> > > >
>>>>>> > > > >>>>>   * The design proposed in CEP-37 has been thoroughly
>>>>>> tried and
>>>>>> > > tested
>>>>>> > > > on an immense scale (hundreds of unique Cassandra clusters,
>>>>>> tens of
>>>>>> > > thousands
>>>>>> > > > of Cassandra nodes, with tens of millions of QPS) on top of 4.1
>>>>>> > > open-source
>>>>>> > > > for more than five years; please see more details[
>>>>>> > > > here](
>>>>>> > >
>>>>>> https://www.uber.com/en-US/blog/how-uber-optimized-cassandra-operations-
>>>>>> > > > at-scale/).
>>>>>> > > >
>>>>>> > > > >>>>>
>>>>>> > > >
>>>>>> > > > >>>>>   * The following
>>>>>> > > > [presentation](
>>>>>> > >
>>>>>> https://docs.google.com/presentation/d/1Zilww9c7LihHULk_ckErI2s4XbObxjWknKqRtbvHyZc/edit#slide=id.g30a4fd4fcf7_0_13
>>>>>> > > )
>>>>>> > > > highlights the rigorous applied to CEP-37, which was given
>>>>>> during last
>>>>>> > > week’s
>>>>>> > > > Apache Cassandra Bay Area [Meetup](
>>>>>> > > https://www.meetup.com/apache-cassandra-
>>>>>> > > > bay-area/events/303469006/),
>>>>>> > > >
>>>>>> > > > >>>>>
>>>>>> > > >
>>>>>> > > > >>>>>
>>>>>> > > >
>>>>>> > > >
>>>>>> > > > >
>>>>>> > > > >>>>>
>>>>>> > > >
>>>>>> > > > >>>>> Since things are massively overhauled, we believe it is
>>>>>> almost
>>>>>> > > ready for
>>>>>> > > > a final pass pre-VOTE. We would like you to please review the
>>>>>> > > > [CEP-37](
>>>>>> > >
>>>>>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-37+Apache+Cassandra+Unified+Repair+Solution\
>>>>>> <https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-37+Apache+Cassandra+Unified+Repair+Solution%5C>
>>>>>> > > <
>>>>>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-37+Apache+Cassandra+Unified+Repair+Solution%5C
>>>>>> >
>>>>>> > > ))
>>>>>> > > > and the associated detailed design
>>>>>> > > > [doc](https://docs.google.com/document/d/1CJWxjEi-
>>>>>> > > >
>>>>>> mBABPMZ3VWJ9w5KavWfJETAGxfUpsViPcPo/edit#heading=h.r112r46toau0).
>>>>>> > > >
>>>>>> > > > >>>>>
>>>>>> > > >
>>>>>> > > > >>>>>
>>>>>> > > > >
>>>>>> > > > >>>>>
>>>>>> > > >
>>>>>> > > > >>>>> Thank you everyone!
>>>>>> > > >
>>>>>> > > > >>>>>
>>>>>> > > >
>>>>>> > > > >>>>> Chris, Andy, Josh, Dinesh, Kristijonas, and Jaydeep
>>>>>> > > >
>>>>>> > > > >>>>>
>>>>>> > > >
>>>>>> > > > >>>>>
>>>>>> > > > >
>>>>>> > > > >
>>>>>> > > > >>>>>
>>>>>> > > >
>>>>>> > > > >>>>>
>>>>>> > > > >
>>>>>> > > > >>>>>
>>>>>> > > >
>>>>>> > > > >>>>> On Thu, Sep 19, 2024 at 11:26 AM Josh McKenzie
>>>>>> > > > <[jmcken...@apache.org](mailto:jmcken...@apache.org)> wrote:
>>>>>> > > > >
>>>>>> > > > >>>>>
>>>>>> > > >
>>>>>> > > > >>>>>>  __
>>>>>> > > >
>>>>>> > > > >>>>>>
>>>>>> > > >
>>>>>> > > > >>>>>> Not quite; finishing touches on the CEP and design doc
>>>>>> are in
>>>>>> > > flight
>>>>>> > > > (as of last / this week).
>>>>>> > > > >
>>>>>> > > > >>>>>>
>>>>>> > > >
>>>>>> > > > >>>>>>
>>>>>> > > > >
>>>>>> > > > >>>>>>
>>>>>> > > >
>>>>>> > > > >>>>>> Soon(tm).
>>>>>> > > >
>>>>>> > > > >>>>>>
>>>>>> > > >
>>>>>> > > > >>>>>>
>>>>>> > > > >
>>>>>> > > > >>>>>>
>>>>>> > > >
>>>>>> > > > >>>>>> On Thu, Sep 19, 2024, at 2:07 PM, Patrick McFadin wrote:
>>>>>> > > > >
>>>>>> > > > >>>>>>
>>>>>> > > >
>>>>>> > > > >>>>>>> Is this CEP ready for a VOTE thread?
>>>>>> > > > <
>>>>>> > >
>>>>>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-37+%28DRAFT%29+Apache+Cassandra+Unified+Repair+Solution
>>>>>> >
>>>>>> > >
>>>>>> > > > >
>>>>>> > > > >>>>>>>
>>>>>> > > >
>>>>>> > > > >>>>>>>
>>>>>> > > > >
>>>>>> > > > >>>>>>>
>>>>>> > > >
>>>>>> > > > >>>>>>> On Sun, Feb 25, 2024 at 12:25 PM Jaydeep Chovatia
>>>>>> > > > <[chovatia.jayd...@gmail.com](mailto:chovatia.jayd...@gmail.com
>>>>>> )>
>>>>>> > > wrote:
>>>>>> > > > >
>>>>>> > > > >>>>>>>
>>>>>> > > >
>>>>>> > > > >>>>>>>> Thanks, Josh. I've just updated the
>>>>>> > > > [CEP](
>>>>>> > >
>>>>>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-37+%28DRAFT%29+Apache+Cassandra+Official+Repair+Solution
>>>>>> > > )
>>>>>> > > > and included all the solutions you mentioned below.
>>>>>> > > > >
>>>>>> > > > >>>>>>>>
>>>>>> > > >
>>>>>> > > > >>>>>>>>
>>>>>> > > > >
>>>>>> > > > >>>>>>>>
>>>>>> > > >
>>>>>> > > > >>>>>>>> Jaydeep
>>>>>> > > > >
>>>>>> > > > >>>>>>>>
>>>>>> > > >
>>>>>> > > > >>>>>>>>
>>>>>> > > > >
>>>>>> > > > >>>>>>>>
>>>>>> > > >
>>>>>> > > > >>>>>>>> On Thu, Feb 22, 2024 at 9:33 AM Josh McKenzie
>>>>>> > > > <[jmcken...@apache.org](mailto:jmcken...@apache.org)> wrote:
>>>>>> > > > >
>>>>>> > > > >>>>>>>>
>>>>>> > > >
>>>>>> > > > >>>>>>>>>  __
>>>>>> > > > >
>>>>>> > > > >>>>>>>>>
>>>>>> > > >
>>>>>> > > > >>>>>>>>> Very late response from me here (basically necro'ing
>>>>>> this
>>>>>> > > thread).
>>>>>> > > > >
>>>>>> > > > >>>>>>>>>
>>>>>> > > >
>>>>>> > > > >>>>>>>>>
>>>>>> > > > >
>>>>>> > > > >>>>>>>>>
>>>>>> > > >
>>>>>> > > > >>>>>>>>> I think it'd be useful to get this condensed into a
>>>>>> CEP that
>>>>>> > > we can
>>>>>> > > > then discuss in that format. It's clearly something we all
>>>>>> agree we need
>>>>>> > > and
>>>>>> > > > having an implementation that works, even if it's not in your
>>>>>> preferred
>>>>>> > > > execution domain, is vastly better than nothing IMO.
>>>>>> > > > >
>>>>>> > > > >>>>>>>>>
>>>>>> > > >
>>>>>> > > > >>>>>>>>>
>>>>>> > > > >
>>>>>> > > > >>>>>>>>>
>>>>>> > > >
>>>>>> > > > >>>>>>>>> I don't have cycles (nor background ;) ) to do that,
>>>>>> but it
>>>>>> > > sounds
>>>>>> > > > like you do Jaydeep given the implementation you have on a
>>>>>> private fork +
>>>>>> > > > design.
>>>>>> > > > >
>>>>>> > > > >>>>>>>>>
>>>>>> > > >
>>>>>> > > > >>>>>>>>>
>>>>>> > > > >
>>>>>> > > > >>>>>>>>>
>>>>>> > > >
>>>>>> > > > >>>>>>>>> A non-exhaustive list of things that might be useful
>>>>>> > > incorporating
>>>>>> > > > into or referencing from a CEP:
>>>>>> > > > >
>>>>>> > > > >>>>>>>>>
>>>>>> > > >
>>>>>> > > > >>>>>>>>> Slack thread: <https://the-
>>>>>> > > > asf.slack.com/archives/CK23JSY2K/p1690225062383619>
>>>>>> > > > >
>>>>>> > > > >>>>>>>>>
>>>>>> > > >
>>>>>> > > > >>>>>>>>> Joey's old C* ticket:
>>>>>> > > > <https://issues.apache.org/jira/browse/CASSANDRA-14346>
>>>>>> > > > >
>>>>>> > > > >>>>>>>>>
>>>>>> > > >
>>>>>> > > > >>>>>>>>> Even older automatic repair scheduling:
>>>>>> > > > <https://issues.apache.org/jira/browse/CASSANDRA-10070>
>>>>>> > > > >
>>>>>> > > > >>>>>>>>>
>>>>>> > > >
>>>>>> > > > >>>>>>>>> Your design gdoc: <
>>>>>> > > https://docs.google.com/document/d/1CJWxjEi-
>>>>>> > > > mBABPMZ3VWJ9w5KavWfJETAGxfUpsViPcPo/edit#heading=h.r112r46toau0>
>>>>>> > > > >
>>>>>> > > > >>>>>>>>>
>>>>>> > > >
>>>>>> > > > >>>>>>>>> PR with automated repair:
>>>>>> > > > <
>>>>>> > >
>>>>>> https://github.com/jaydeepkumar1984/cassandra/commit/ef6456d652c0d07cf29d88dfea03b73704814c2c
>>>>>> >
>>>>>> > >
>>>>>> > > > >
>>>>>> > > > >>>>>>>>>
>>>>>> > > >
>>>>>> > > > >>>>>>>>>
>>>>>> > > > >
>>>>>> > > > >>>>>>>>>
>>>>>> > > >
>>>>>> > > > >>>>>>>>> My intuition is that we're all basically in agreement
>>>>>> that
>>>>>> > > this is
>>>>>> > > > something the DB needs, we're all willing to bikeshed for our
>>>>>> personal
>>>>>> > > > preference on where it lives and how it's implemented, and at
>>>>>> the end of
>>>>>> > > the
>>>>>> > > > day, code talks. I don't think anyone's said they'll die on the
>>>>>> hill of
>>>>>> > > > implementation details, so that feels like CEP time to me.
>>>>>> > > > >
>>>>>> > > > >>>>>>>>>
>>>>>> > > >
>>>>>> > > > >>>>>>>>>
>>>>>> > > > >
>>>>>> > > > >>>>>>>>>
>>>>>> > > >
>>>>>> > > > >>>>>>>>> If you were willing and able to get a CEP together for
>>>>>> > > automated
>>>>>> > > > repair based on the above material, given you've done the work
>>>>>> and have
>>>>>> > > the
>>>>>> > > > proof points it's working at scale, I think this would be a
>>>>>> _huge
>>>>>> > > > contribution_ to the community.
>>>>>> > > > >
>>>>>> > > > >>>>>>>>>
>>>>>> > > >
>>>>>> > > > >>>>>>>>>
>>>>>> > > > >
>>>>>> > > > >>>>>>>>>
>>>>>> > > >
>>>>>> > > > >>>>>>>>> On Thu, Aug 24, 2023, at 7:26 PM, Jaydeep Chovatia
>>>>>> wrote:
>>>>>> > > > >
>>>>>> > > > >>>>>>>>>
>>>>>> > > >
>>>>>> > > > >>>>>>>>>> Is anyone going to file an official CEP for this?
>>>>>> > > > >
>>>>>> > > > >>>>>>>>>>
>>>>>> > > >
>>>>>> > > > >>>>>>>>>> As mentioned in this email thread, here is one of the
>>>>>> > > solution's
>>>>>> > > > [design doc](https://docs.google.com/document/d/1CJWxjEi-
>>>>>> > > >
>>>>>> mBABPMZ3VWJ9w5KavWfJETAGxfUpsViPcPo/edit#heading=h.r112r46toau0) and
>>>>>> > > source
>>>>>> > > > code on a private Apache Cassandra patch. Could you go through
>>>>>> it and
>>>>>> > > let me
>>>>>> > > > know what you think?
>>>>>> > > > >
>>>>>> > > > >>>>>>>>>>
>>>>>> > > >
>>>>>> > > > >>>>>>>>>>
>>>>>> > > > >
>>>>>> > > > >>>>>>>>>>
>>>>>> > > >
>>>>>> > > > >>>>>>>>>> Jaydeep
>>>>>> > > > >
>>>>>> > > > >>>>>>>>>>
>>>>>> > > >
>>>>>> > > > >>>>>>>>>>
>>>>>> > > > >
>>>>>> > > > >>>>>>>>>>
>>>>>> > > >
>>>>>> > > > >>>>>>>>>> On Wed, Aug 2, 2023 at 3:54 PM Jon Haddad
>>>>>> > > > <[rustyrazorbl...@apache.org](mailto:rustyrazorbl...@apache.org
>>>>>> )>
>>>>>> > > wrote:
>>>>>> > > > >
>>>>>> > > > >>>>>>>>>>
>>>>>> > > >
>>>>>> > > > >>>>>>>>>>> > That said I would happily support an effort to
>>>>>> bring repair
>>>>>> > > > scheduling to the sidecar immediately. This has nothing
>>>>>> blocking it, and
>>>>>> > > would
>>>>>> > > > potentially enable the sidecar to provide an official repair
>>>>>> scheduling
>>>>>> > > > solution that is compatible with current or even previous
>>>>>> versions of the
>>>>>> > > > database.
>>>>>> > > > >
>>>>>> > > > >>>>>>>>>>>
>>>>>> > > >
>>>>>> > > > >>>>>>>>>>>
>>>>>> > > > >
>>>>>> > > > >>>>>>>>>>>
>>>>>> > > >
>>>>>> > > > >>>>>>>>>>> This is something I hadn't thought much about, and
>>>>>> is a
>>>>>> > > pretty
>>>>>> > > > good argument for using the sidecar initially.  There's a lot of
>>>>>> > > deployments
>>>>>> > > > out there and having an official repair option would be a big
>>>>>> win.
>>>>>> > > > >
>>>>>> > > > >>>>>>>>>>>
>>>>>> > > >
>>>>>> > > > >>>>>>>>>>>
>>>>>> > > > >
>>>>>> > > > >>>>>>>>>>>
>>>>>> > > >
>>>>>> > > > >>>>>>>>>>>
>>>>>> > > > >
>>>>>> > > > >>>>>>>>>>>
>>>>>> > > >
>>>>>> > > > >>>>>>>>>>> On 2023/07/26 23:20:07 "C. Scott Andreas" wrote:
>>>>>> > > > >
>>>>>> > > > >>>>>>>>>>>
>>>>>> > > >
>>>>>> > > > >>>>>>>>>>> > I agree that it would be ideal for Cassandra to
>>>>>> have a
>>>>>> > > repair
>>>>>> > > > scheduler in-DB.
>>>>>> > > > >
>>>>>> > > > >>>>>>>>>>>
>>>>>> > > >
>>>>>> > > > >>>>>>>>>>> >
>>>>>> > > > >
>>>>>> > > > >>>>>>>>>>>
>>>>>> > > >
>>>>>> > > > >>>>>>>>>>> > That said I would happily support an effort to
>>>>>> bring repair
>>>>>> > > > scheduling to the sidecar immediately. This has nothing
>>>>>> blocking it, and
>>>>>> > > would
>>>>>> > > > potentially enable the sidecar to provide an official repair
>>>>>> scheduling
>>>>>> > > > solution that is compatible with current or even previous
>>>>>> versions of the
>>>>>> > > > database.
>>>>>> > > > >
>>>>>> > > > >>>>>>>>>>>
>>>>>> > > >
>>>>>> > > > >>>>>>>>>>> >
>>>>>> > > > >
>>>>>> > > > >>>>>>>>>>>
>>>>>> > > >
>>>>>> > > > >>>>>>>>>>> > Once TCM has landed, we’ll have much stronger
>>>>>> primitives
>>>>>> > > for
>>>>>> > > > repair orchestration in the database itself. But I don’t think
>>>>>> that
>>>>>> > > should
>>>>>> > > > block progress on a repair scheduling solution in the sidecar,
>>>>>> and there
>>>>>> > > is
>>>>>> > > > nothing that would prevent someone from continuing to use a
>>>>>> sidecar-based
>>>>>> > > > solution in perpetuity if they preferred.
>>>>>> > > > >
>>>>>> > > > >>>>>>>>>>>
>>>>>> > > >
>>>>>> > > > >>>>>>>>>>> >
>>>>>> > > > >
>>>>>> > > > >>>>>>>>>>>
>>>>>> > > >
>>>>>> > > > >>>>>>>>>>> > \- Scott
>>>>>> > > > >
>>>>>> > > > >>>>>>>>>>>
>>>>>> > > >
>>>>>> > > > >>>>>>>>>>> >
>>>>>> > > > >
>>>>>> > > > >>>>>>>>>>>
>>>>>> > > >
>>>>>> > > > >>>>>>>>>>> > > On Jul 26, 2023, at 3:25 PM, Jon Haddad
>>>>>> > > > <[rustyrazorbl...@apache.org](mailto:rustyrazorbl...@apache.org
>>>>>> )>
>>>>>> > > wrote:
>>>>>> > > > >
>>>>>> > > > >>>>>>>>>>>
>>>>>> > > >
>>>>>> > > > >>>>>>>>>>> > >
>>>>>> > > > >
>>>>>> > > > >>>>>>>>>>>
>>>>>> > > >
>>>>>> > > > >>>>>>>>>>> > > I'm 100% in favor of repair being part of the
>>>>>> core DB,
>>>>>> > > not
>>>>>> > > > the sidecar.  The current (and past) state of things where
>>>>>> running the DB
>>>>>> > > > correctly *requires* running a separate process (either
>>>>>> community
>>>>>> > > maintained
>>>>>> > > > or official C* sidecar) is incredibly painful for folks.  The
>>>>>> idea that
>>>>>> > > your
>>>>>> > > > data integrity needs to be opt-in has never made sense to me
>>>>>> from the
>>>>>> > > > perspective of either the product or the end user.
>>>>>> > > > >
>>>>>> > > > >>>>>>>>>>>
>>>>>> > > >
>>>>>> > > > >>>>>>>>>>> > >
>>>>>> > > > >
>>>>>> > > > >>>>>>>>>>>
>>>>>> > > >
>>>>>> > > > >>>>>>>>>>> > > I've worked with way too many teams that have
>>>>>> either
>>>>>> > > > configured this incorrectly or not at all.
>>>>>> > > > >
>>>>>> > > > >>>>>>>>>>>
>>>>>> > > >
>>>>>> > > > >>>>>>>>>>> > >
>>>>>> > > > >
>>>>>> > > > >>>>>>>>>>>
>>>>>> > > >
>>>>>> > > > >>>>>>>>>>> > > Ideally Cassandra would ship with repair built
>>>>>> in and on
>>>>>> > > by
>>>>>> > > > default.  Power users can disable if they want to continue to
>>>>>> maintain
>>>>>> > > their
>>>>>> > > > own repair tooling for some reason.
>>>>>> > > > >
>>>>>> > > > >>>>>>>>>>>
>>>>>> > > >
>>>>>> > > > >>>>>>>>>>> > >
>>>>>> > > > >
>>>>>> > > > >>>>>>>>>>>
>>>>>> > > >
>>>>>> > > > >>>>>>>>>>> > > Jon
>>>>>> > > > >
>>>>>> > > > >>>>>>>>>>>
>>>>>> > > >
>>>>>> > > > >>>>>>>>>>> > >
>>>>>> > > > >
>>>>>> > > > >>>>>>>>>>>
>>>>>> > > >
>>>>>> > > > >>>>>>>>>>> > >> On 2023/07/24 20:44:14 German Eichberger via
>>>>>> dev
>>>>>> > > wrote:
>>>>>> > > > >
>>>>>> > > > >>>>>>>>>>>
>>>>>> > > >
>>>>>> > > > >>>>>>>>>>> > >> All,
>>>>>> > > > >
>>>>>> > > > >>>>>>>>>>>
>>>>>> > > >
>>>>>> > > > >>>>>>>>>>> > >> We had a brief discussion in [2] about the
>>>>>> Uber article
>>>>>> > > [1]
>>>>>> > > > where they talk about having integrated repair into Cassandra
>>>>>> and how
>>>>>> > > great
>>>>>> > > > that is. I expressed my disappointment that they didn't work
>>>>>> with the
>>>>>> > > > community on that (Uber, if you are listening time to make
>>>>>> amends 🙂)
>>>>>> > > and it
>>>>>> > > > turns out Joey already had the idea and wrote the code [3] - so
>>>>>> I wanted
>>>>>> > > to
>>>>>> > > > start a discussion to gauge interest and maybe how to revive
>>>>>> that
>>>>>> > > effort.
>>>>>> > > > >
>>>>>> > > > >>>>>>>>>>>
>>>>>> > > >
>>>>>> > > > >>>>>>>>>>> > >> Thanks,
>>>>>> > > > >
>>>>>> > > > >>>>>>>>>>>
>>>>>> > > >
>>>>>> > > > >>>>>>>>>>> > >> German
>>>>>> > > > >
>>>>>> > > > >>>>>>>>>>>
>>>>>> > > >
>>>>>> > > > >>>>>>>>>>> > >> [1] <
>>>>>> > > https://www.uber.com/blog/how-uber-optimized-cassandra-
>>>>>> > > > operations-at-scale/>
>>>>>> > > > >
>>>>>> > > > >>>>>>>>>>>
>>>>>> > > >
>>>>>> > > > >>>>>>>>>>> > >> [2] <https://the-
>>>>>> > > > asf.slack.com/archives/CK23JSY2K/p1690225062383619>
>>>>>> > > > >
>>>>>> > > > >>>>>>>>>>>
>>>>>> > > >
>>>>>> > > > >>>>>>>>>>> > >> [3] <
>>>>>> > > https://issues.apache.org/jira/browse/CASSANDRA-14346>
>>>>>> > > > >
>>>>>> > > > >>>>>>>>>>>
>>>>>> > > >
>>>>>> > > > >>>>>>>>>>> >
>>>>>> > > > >
>>>>>> > > > >>>>>>>>>
>>>>>> > > >
>>>>>> > > > >>>>>>>>>
>>>>>> > > > >
>>>>>> > > > >>>>>>
>>>>>> > > >
>>>>>> > > > >>>>>>
>>>>>> > > > >
>>>>>> > > >
>>>>>> > > >
>>>>>> > >
>>>>>> >
>>>>>>
>>>>>

Re: [Discuss] Repair inside C*

Reply via email to