>Repairs managed by Reaper and Scylla Manager do not guarantee a deterministic ordering or timing of individual nodetool repair operations they manage between separate cycles, breaking the "you are performing the cycles in the same order around the nodes every time” assumption. That’s the context from which my original cluster-level repair example comes from.
To simplify operations, the newly introduced in-built AutoRepair feature in Cassandra (as part of CEP-37 <https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-37%3A+Apache+Cassandra+Unified+Repair+Solution>) includes intelligent behavior that tracks the oldest repaired node in the cluster and prioritizes it for repair. It also emits a range of metrics to assist operators. One key metric, LongestUnrepairedSec <https://github.com/apache/cassandra/blob/trunk/doc/modules/cassandra/pages/managing/operating/metrics.adoc#automated-repair-metrics>, indicates how long it has been since the last repair for any part of the data. Operators can create an alarm on the metric if it becomes higher than the *gc_grace_seconds*. Jaydeep On Sat, May 17, 2025 at 12:54 PM Mike Sun <m...@msun.io> wrote: > Jeremiah, you’re right, I’ve been using “repair” to mean a cluster-level > repair as opposed to a single “nodetool repair” operation, and the > Cassandra docs mean “nodetool repair” when referring to a repair. Thanks > for pointing that out! I agree that the recommendation to run a “nodetool > repair” on every node or token range every 7 days with a gc_grace_seconds = > 10 days should practically prevent data resurrection. > > I still think theoretically though, starting and completing each nodetool > repair operation within gc_grace_seconds won't absolutely guarantee that > there’s no chance of an expired tombstone. nodetool repair operations on > the same node+token range(s) don't always take the same amount of time to > run and therefore don’t guarantee that specific tokens are always repaired > at the same elapsed time. > > e.g. if gc_grace_seconds=10 hours, nodetool repair is run every 7 hours, > nodetool repair operations can take between 2 to 5 hours > > - 00:00 - nodetool repair 1 starts on node A > - 00:30 - nodetool repair 1 repairs token T > - 01:00 - token T is deleted > - 02:00 - nodetool repair 1 completes > - 07:00 - nodetool repair 2 starts on node A > - 11:00 - tombstone for token T expires > - 11:30 - nodetool repair 2 repairs token T > - 12:00 - nodetool repair completes > > In reality, I agree this is very unlikely to happen. But if we’re looking > to establish a rigorous requirement that prevents any chance of data > resurrection, then I believe it’s the invariant I proposed for > “cluster-level repairs”—that two consecutive complete repairs must succeed > within gc_grace_seconds. Theoretical risk of data resurrection is something > that keeps me up at night! :). > > More practically, in my experience with Cassandra and Scylla clusters, I > think most operators reason about repairs as “cluster-level” as opposed to > individual “nodetool repair” operations, especially due to the use of > Reaper for Cassandra and Scylla Manager. Reaper and Scylla Manager repairs > jobs are cluster-level and repair admin+monitoring is generally at the > cluster-level, e.g. cluster-level repair schedules, durations, > success/completions. > > Repairs managed by Reaper and Scylla Manager do not guarantee a > deterministic ordering or timing of individual nodetool repair operations > they manage between separate cycles, breaking the "you are performing the > cycles in the same order around the nodes every time” assumption. That’s > the context from which my original cluster-level repair example comes from. > > Thanks for the helpful discussion, I will update my blog post to reflect > the helpful clarifications! > > On Fri, May 16, 2025 at 5:25 PM Jeremiah Jordan <jerem...@apache.org> > wrote: > >> I agree we need to do a better job and wording this so people can >> understand what is happening. >> >> For your exact example here, you are actually looking at too broad of a >> thing. The exact requirements are not at the full cluster level, but >> actually at the “token range” level at which repair operates, a given token >> range needs to have repair start and complete within the gc_grace sliding >> window. For your example of a repair cycle that takes 5 days, and is >> started every 7 days, assuming you are performing that cycles in the same >> order around the nodes every time, a given node will have been repaired >> within 7 days, even though the start of repair 1 to the finish of repair 2 >> was more than 7 days. The start of “token ranges repaired on day 0” to the >> finish of “token ranges repaired on day 7” is less than the gc_grace window. >> >> -Jeremiah Jordan >> >> On May 16, 2025 at 2:03:00 PM, Mike Sun <m...@msun.io> wrote: >> >>> The wording is subtle and can be confusing... >>> >>> It's important to distinguish between: >>> 1. "You need to start and complete a repair within any gc_grace_seconds >>> window" >>> 2. "You need to start and complete a repair within gc_grace_seconds" >>> >>> #1 is a sliding time window for any time interval in which the tombstone >>> (tombstone_created_time is written and the expiration of >>> it (tombstoned_created_time + gc_grace_seconds) >>> >>> #2 is a duration bound for the repair time >>> >>> My post is saying that to ensure the #1 requirement, you actually need >>> to "start and complete two consecutive repairs within gc_grace_seconds" >>> >>> >>> On Fri, May 16, 2025 at 2:49 PM Mike Sun <m...@msun.io> wrote: >>> >>>> > You need to *start and complete* a repair within any gc_grace_seconds >>>> window. >>>> Exactly this. And since "any gc_grace_seconds" does not mean "any >>>> gc_grace_window from which a repair starts"... the requirement needs to be >>>> that the duration to "start and complete" two consecutive full repairs is >>>> within gc_grace_seconds"... that will ensure a repair "starts and >>>> completes" within "any gc_grace_seconds" window >>>> >>>> >>>> >>>> On Fri, May 16, 2025 at 2:43 PM Mick Semb Wever <m...@apache.org> wrote: >>>> >>>>> . >>>>> >>>>> >>>>>> e.g., assume gc_grace_seconds=10 days, a repair takes 5 days to run >>>>>> * Day 0: Repair 1 starts and processes token A >>>>>> * Day 1: Token A is deleted resulting in Tombstone A that will expire >>>>>> on Day 11 >>>>>> * Day 5: Repair 1 completes >>>>>> * Day 7: Repair 2 starts >>>>>> * Day 11: Tombstone A expires without being repaired >>>>>> * Day 12: Repair 2 repairs Token A and completes >>>>>> >>>>> >>>>> >>>>> You need to *start and complete* a repair within any gc_grace_seconds >>>>> window. >>>>> In your example no repair started and completed in the Day 1-11 >>>>> window. >>>>> >>>>> We do need to word this better, thanks for pointing it out Mike. >>>>> >>>>