> > To simplify operations, the newly introduced in-built AutoRepair feature >> in Cassandra (as part of CEP-37 >> <https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-37%3A+Apache+Cassandra+Unified+Repair+Solution>) >> includes intelligent behavior that tracks the oldest repaired node in the >> cluster and prioritizes it for repair. It also emits a range of metrics to >> assist operators. One key metric, LongestUnrepairedSec >> <https://github.com/apache/cassandra/blob/trunk/doc/modules/cassandra/pages/managing/operating/metrics.adoc#automated-repair-metrics>, >> indicates how long it has been since the last repair for any part of the >> data. Operators can create an alarm on the metric if it becomes higher than >> the *gc_grace_seconds*. > > > This is great to hear! Thanks for pointing me to that Jaydeep. It will definitely make it easier for operators to monitor and alarm on potential expiring tombstone risks. I will update my post to include this upcoming feature.
Best, Mike Sun On Sat, May 17, 2025 at 12:54 PM Mike Sun <m...@msun.io> wrote: > >> Jeremiah, you’re right, I’ve been using “repair” to mean a cluster-level >> repair as opposed to a single “nodetool repair” operation, and the >> Cassandra docs mean “nodetool repair” when referring to a repair. Thanks >> for pointing that out! I agree that the recommendation to run a “nodetool >> repair” on every node or token range every 7 days with a gc_grace_seconds = >> 10 days should practically prevent data resurrection. >> >> I still think theoretically though, starting and completing each nodetool >> repair operation within gc_grace_seconds won't absolutely guarantee that >> there’s no chance of an expired tombstone. nodetool repair operations on >> the same node+token range(s) don't always take the same amount of time to >> run and therefore don’t guarantee that specific tokens are always repaired >> at the same elapsed time. >> >> e.g. if gc_grace_seconds=10 hours, nodetool repair is run every 7 hours, >> nodetool repair operations can take between 2 to 5 hours >> >> - 00:00 - nodetool repair 1 starts on node A >> - 00:30 - nodetool repair 1 repairs token T >> - 01:00 - token T is deleted >> - 02:00 - nodetool repair 1 completes >> - 07:00 - nodetool repair 2 starts on node A >> - 11:00 - tombstone for token T expires >> - 11:30 - nodetool repair 2 repairs token T >> - 12:00 - nodetool repair completes >> >> In reality, I agree this is very unlikely to happen. But if we’re looking >> to establish a rigorous requirement that prevents any chance of data >> resurrection, then I believe it’s the invariant I proposed for >> “cluster-level repairs”—that two consecutive complete repairs must succeed >> within gc_grace_seconds. Theoretical risk of data resurrection is something >> that keeps me up at night! :). >> >> More practically, in my experience with Cassandra and Scylla clusters, I >> think most operators reason about repairs as “cluster-level” as opposed to >> individual “nodetool repair” operations, especially due to the use of >> Reaper for Cassandra and Scylla Manager. Reaper and Scylla Manager repairs >> jobs are cluster-level and repair admin+monitoring is generally at the >> cluster-level, e.g. cluster-level repair schedules, durations, >> success/completions. >> >> Repairs managed by Reaper and Scylla Manager do not guarantee a >> deterministic ordering or timing of individual nodetool repair operations >> they manage between separate cycles, breaking the "you are performing the >> cycles in the same order around the nodes every time” assumption. That’s >> the context from which my original cluster-level repair example comes from. >> >> Thanks for the helpful discussion, I will update my blog post to reflect >> the helpful clarifications! >> >> On Fri, May 16, 2025 at 5:25 PM Jeremiah Jordan <jerem...@apache.org> >> wrote: >> >>> I agree we need to do a better job and wording this so people can >>> understand what is happening. >>> >>> For your exact example here, you are actually looking at too broad of a >>> thing. The exact requirements are not at the full cluster level, but >>> actually at the “token range” level at which repair operates, a given token >>> range needs to have repair start and complete within the gc_grace sliding >>> window. For your example of a repair cycle that takes 5 days, and is >>> started every 7 days, assuming you are performing that cycles in the same >>> order around the nodes every time, a given node will have been repaired >>> within 7 days, even though the start of repair 1 to the finish of repair 2 >>> was more than 7 days. The start of “token ranges repaired on day 0” to the >>> finish of “token ranges repaired on day 7” is less than the gc_grace window. >>> >>> -Jeremiah Jordan >>> >>> On May 16, 2025 at 2:03:00 PM, Mike Sun <m...@msun.io> wrote: >>> >>>> The wording is subtle and can be confusing... >>>> >>>> It's important to distinguish between: >>>> 1. "You need to start and complete a repair within any gc_grace_seconds >>>> window" >>>> 2. "You need to start and complete a repair within gc_grace_seconds" >>>> >>>> #1 is a sliding time window for any time interval in which the tombstone >>>> (tombstone_created_time is written and the expiration of >>>> it (tombstoned_created_time + gc_grace_seconds) >>>> >>>> #2 is a duration bound for the repair time >>>> >>>> My post is saying that to ensure the #1 requirement, you actually need >>>> to "start and complete two consecutive repairs within gc_grace_seconds" >>>> >>>> >>>> On Fri, May 16, 2025 at 2:49 PM Mike Sun <m...@msun.io> wrote: >>>> >>>>> > You need to *start and complete* a repair within any gc_grace_seconds >>>>> window. >>>>> Exactly this. And since "any gc_grace_seconds" does not mean "any >>>>> gc_grace_window from which a repair starts"... the requirement needs to be >>>>> that the duration to "start and complete" two consecutive full repairs is >>>>> within gc_grace_seconds"... that will ensure a repair "starts and >>>>> completes" within "any gc_grace_seconds" window >>>>> >>>>> >>>>> >>>>> On Fri, May 16, 2025 at 2:43 PM Mick Semb Wever <m...@apache.org> >>>>> wrote: >>>>> >>>>>> . >>>>>> >>>>>> >>>>>>> e.g., assume gc_grace_seconds=10 days, a repair takes 5 days to run >>>>>>> * Day 0: Repair 1 starts and processes token A >>>>>>> * Day 1: Token A is deleted resulting in Tombstone A that will >>>>>>> expire on Day 11 >>>>>>> * Day 5: Repair 1 completes >>>>>>> * Day 7: Repair 2 starts >>>>>>> * Day 11: Tombstone A expires without being repaired >>>>>>> * Day 12: Repair 2 repairs Token A and completes >>>>>>> >>>>>> >>>>>> >>>>>> You need to *start and complete* a repair within any gc_grace_seconds >>>>>> window. >>>>>> In your example no repair started and completed in the Day 1-11 >>>>>> window. >>>>>> >>>>>> We do need to word this better, thanks for pointing it out Mike. >>>>>> >>>>>