Thanks everyone for your helpful feedback! I've updated my blog post to hopefully reflect these clarifications: https://msun.io/cassandra-scylla-repairs/ <https://msun.io/cassandra-scylla-repairs/index.html>
On Mon, May 19, 2025 at 9:27 AM Mike Sun <m...@msun.io> wrote: > To simplify operations, the newly introduced in-built AutoRepair feature >>> in Cassandra (as part of CEP-37 >>> <https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-37%3A+Apache+Cassandra+Unified+Repair+Solution>) >>> includes intelligent behavior that tracks the oldest repaired node in the >>> cluster and prioritizes it for repair. It also emits a range of metrics to >>> assist operators. One key metric, LongestUnrepairedSec >>> <https://github.com/apache/cassandra/blob/trunk/doc/modules/cassandra/pages/managing/operating/metrics.adoc#automated-repair-metrics>, >>> indicates how long it has been since the last repair for any part of the >>> data. Operators can create an alarm on the metric if it becomes higher than >>> the *gc_grace_seconds*. >> >> >> This is great to hear! Thanks for pointing me to that Jaydeep. It will > definitely make it easier for operators to monitor and alarm on potential > expiring tombstone risks. I will update my post to include this upcoming > feature. > > Best, > Mike Sun > > On Sat, May 17, 2025 at 12:54 PM Mike Sun <m...@msun.io> wrote: >> >>> Jeremiah, you’re right, I’ve been using “repair” to mean a cluster-level >>> repair as opposed to a single “nodetool repair” operation, and the >>> Cassandra docs mean “nodetool repair” when referring to a repair. Thanks >>> for pointing that out! I agree that the recommendation to run a “nodetool >>> repair” on every node or token range every 7 days with a gc_grace_seconds = >>> 10 days should practically prevent data resurrection. >>> >>> I still think theoretically though, starting and completing each >>> nodetool repair operation within gc_grace_seconds won't absolutely >>> guarantee that there’s no chance of an expired tombstone. nodetool repair >>> operations on the same node+token range(s) don't always take the same >>> amount of time to run and therefore don’t guarantee that specific tokens >>> are always repaired at the same elapsed time. >>> >>> e.g. if gc_grace_seconds=10 hours, nodetool repair is run every 7 hours, >>> nodetool repair operations can take between 2 to 5 hours >>> >>> - 00:00 - nodetool repair 1 starts on node A >>> - 00:30 - nodetool repair 1 repairs token T >>> - 01:00 - token T is deleted >>> - 02:00 - nodetool repair 1 completes >>> - 07:00 - nodetool repair 2 starts on node A >>> - 11:00 - tombstone for token T expires >>> - 11:30 - nodetool repair 2 repairs token T >>> - 12:00 - nodetool repair completes >>> >>> In reality, I agree this is very unlikely to happen. But if we’re >>> looking to establish a rigorous requirement that prevents any chance of >>> data resurrection, then I believe it’s the invariant I proposed for >>> “cluster-level repairs”—that two consecutive complete repairs must succeed >>> within gc_grace_seconds. Theoretical risk of data resurrection is something >>> that keeps me up at night! :). >>> >>> More practically, in my experience with Cassandra and Scylla clusters, I >>> think most operators reason about repairs as “cluster-level” as opposed to >>> individual “nodetool repair” operations, especially due to the use of >>> Reaper for Cassandra and Scylla Manager. Reaper and Scylla Manager repairs >>> jobs are cluster-level and repair admin+monitoring is generally at the >>> cluster-level, e.g. cluster-level repair schedules, durations, >>> success/completions. >>> >>> Repairs managed by Reaper and Scylla Manager do not guarantee a >>> deterministic ordering or timing of individual nodetool repair operations >>> they manage between separate cycles, breaking the "you are performing the >>> cycles in the same order around the nodes every time” assumption. That’s >>> the context from which my original cluster-level repair example comes from. >>> >>> Thanks for the helpful discussion, I will update my blog post to reflect >>> the helpful clarifications! >>> >>> On Fri, May 16, 2025 at 5:25 PM Jeremiah Jordan <jerem...@apache.org> >>> wrote: >>> >>>> I agree we need to do a better job and wording this so people can >>>> understand what is happening. >>>> >>>> For your exact example here, you are actually looking at too broad of a >>>> thing. The exact requirements are not at the full cluster level, but >>>> actually at the “token range” level at which repair operates, a given token >>>> range needs to have repair start and complete within the gc_grace sliding >>>> window. For your example of a repair cycle that takes 5 days, and is >>>> started every 7 days, assuming you are performing that cycles in the same >>>> order around the nodes every time, a given node will have been repaired >>>> within 7 days, even though the start of repair 1 to the finish of repair 2 >>>> was more than 7 days. The start of “token ranges repaired on day 0” to the >>>> finish of “token ranges repaired on day 7” is less than the gc_grace >>>> window. >>>> >>>> -Jeremiah Jordan >>>> >>>> On May 16, 2025 at 2:03:00 PM, Mike Sun <m...@msun.io> wrote: >>>> >>>>> The wording is subtle and can be confusing... >>>>> >>>>> It's important to distinguish between: >>>>> 1. "You need to start and complete a repair within any gc_grace_seconds >>>>> window" >>>>> 2. "You need to start and complete a repair within gc_grace_seconds" >>>>> >>>>> #1 is a sliding time window for any time interval in which the tombstone >>>>> (tombstone_created_time is written and the expiration of >>>>> it (tombstoned_created_time + gc_grace_seconds) >>>>> >>>>> #2 is a duration bound for the repair time >>>>> >>>>> My post is saying that to ensure the #1 requirement, you actually need >>>>> to "start and complete two consecutive repairs within gc_grace_seconds" >>>>> >>>>> >>>>> On Fri, May 16, 2025 at 2:49 PM Mike Sun <m...@msun.io> wrote: >>>>> >>>>>> > You need to *start and complete* a repair within any gc_grace_seconds >>>>>> window. >>>>>> Exactly this. And since "any gc_grace_seconds" does not mean "any >>>>>> gc_grace_window from which a repair starts"... the requirement needs to >>>>>> be >>>>>> that the duration to "start and complete" two consecutive full repairs is >>>>>> within gc_grace_seconds"... that will ensure a repair "starts and >>>>>> completes" within "any gc_grace_seconds" window >>>>>> >>>>>> >>>>>> >>>>>> On Fri, May 16, 2025 at 2:43 PM Mick Semb Wever <m...@apache.org> >>>>>> wrote: >>>>>> >>>>>>> . >>>>>>> >>>>>>> >>>>>>>> e.g., assume gc_grace_seconds=10 days, a repair takes 5 days to run >>>>>>>> * Day 0: Repair 1 starts and processes token A >>>>>>>> * Day 1: Token A is deleted resulting in Tombstone A that will >>>>>>>> expire on Day 11 >>>>>>>> * Day 5: Repair 1 completes >>>>>>>> * Day 7: Repair 2 starts >>>>>>>> * Day 11: Tombstone A expires without being repaired >>>>>>>> * Day 12: Repair 2 repairs Token A and completes >>>>>>>> >>>>>>> >>>>>>> >>>>>>> You need to *start and complete* a repair within any gc_grace_seconds >>>>>>> window. >>>>>>> In your example no repair started and completed in the Day 1-11 >>>>>>> window. >>>>>>> >>>>>>> We do need to word this better, thanks for pointing it out Mike. >>>>>>> >>>>>>