Re: Stricter repair time requirements necessary to prevent data resurrection than advised by docs

Mike Sun Mon, 19 May 2025 09:41:23 -0700

Thanks everyone for your helpful feedback! I've updated my blog post to
hopefully reflect these clarifications:
https://msun.io/cassandra-scylla-repairs/
<https://msun.io/cassandra-scylla-repairs/index.html>


On Mon, May 19, 2025 at 9:27 AM Mike Sun <m...@msun.io> wrote:

> To simplify operations, the newly introduced in-built AutoRepair feature
>>> in Cassandra (as part of CEP-37
>>> <https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-37%3A+Apache+Cassandra+Unified+Repair+Solution>)
>>> includes intelligent behavior that tracks the oldest repaired node in the
>>> cluster and prioritizes it for repair. It also emits a range of metrics to
>>> assist operators. One key metric, LongestUnrepairedSec
>>> <https://github.com/apache/cassandra/blob/trunk/doc/modules/cassandra/pages/managing/operating/metrics.adoc#automated-repair-metrics>,
>>> indicates how long it has been since the last repair for any part of the
>>> data. Operators can create an alarm on the metric if it becomes higher than
>>> the *gc_grace_seconds*.
>>
>>
>> This is great to hear! Thanks for pointing me to that Jaydeep. It will
> definitely make it easier for operators to monitor and alarm on potential
> expiring tombstone risks. I will update my post to include this upcoming
> feature.
>
> Best,
> Mike Sun
>
> On Sat, May 17, 2025 at 12:54 PM Mike Sun <m...@msun.io> wrote:
>>
>>> Jeremiah, you’re right, I’ve been using “repair” to mean a cluster-level
>>> repair as opposed to a single “nodetool repair” operation, and the
>>> Cassandra docs mean “nodetool repair” when referring to a repair. Thanks
>>> for pointing that out! I agree that the recommendation to run a “nodetool
>>> repair” on every node or token range every 7 days with a gc_grace_seconds =
>>> 10 days should practically prevent data resurrection.
>>>
>>> I still think theoretically though, starting and completing each
>>> nodetool repair operation within gc_grace_seconds won't absolutely
>>> guarantee that there’s no chance of an expired tombstone. nodetool repair
>>> operations on the same node+token range(s) don't always take the same
>>> amount of time to run and therefore don’t guarantee that specific tokens
>>> are always repaired at the same elapsed time.
>>>
>>> e.g. if gc_grace_seconds=10 hours, nodetool repair is run every 7 hours,
>>> nodetool repair operations can take between 2 to 5 hours
>>>
>>>    - 00:00 - nodetool repair 1 starts on node A
>>>    - 00:30 - nodetool repair 1 repairs token T
>>>    - 01:00 - token T is deleted
>>>    - 02:00 - nodetool repair 1 completes
>>>    - 07:00 - nodetool repair 2 starts on node A
>>>    - 11:00 - tombstone for token T expires
>>>    - 11:30 - nodetool repair 2 repairs token T
>>>    - 12:00 - nodetool repair completes
>>>
>>> In reality, I agree this is very unlikely to happen. But if we’re
>>> looking to establish a rigorous requirement that prevents any chance of
>>> data resurrection, then I believe it’s the invariant I proposed for
>>> “cluster-level repairs”—that two consecutive complete repairs must succeed
>>> within gc_grace_seconds. Theoretical risk of data resurrection is something
>>> that keeps me up at night! :).
>>>
>>> More practically, in my experience with Cassandra and Scylla clusters, I
>>> think most operators reason about repairs as “cluster-level” as opposed to
>>> individual “nodetool repair” operations, especially due to the use of
>>> Reaper for Cassandra and Scylla Manager. Reaper and Scylla Manager repairs
>>> jobs are cluster-level and repair admin+monitoring is generally at the
>>> cluster-level, e.g. cluster-level repair schedules, durations,
>>> success/completions.
>>>
>>> Repairs managed by Reaper and Scylla Manager do not guarantee a
>>> deterministic ordering or timing of individual nodetool repair operations
>>> they manage between separate cycles, breaking the "you are performing the
>>> cycles in the same order around the nodes every time” assumption. That’s
>>> the context from which my original cluster-level repair example comes from.
>>>
>>> Thanks for the helpful discussion, I will update my blog post to reflect
>>> the helpful clarifications!
>>>
>>> On Fri, May 16, 2025 at 5:25 PM Jeremiah Jordan <jerem...@apache.org>
>>> wrote:
>>>
>>>> I agree we need to do a better job and wording this so people can
>>>> understand what is happening.
>>>>
>>>> For your exact example here, you are actually looking at too broad of a
>>>> thing.  The exact requirements are not at the full cluster level, but
>>>> actually at the “token range” level at which repair operates, a given token
>>>> range needs to have repair start and complete within the gc_grace sliding
>>>> window.  For your example of a repair cycle that takes 5 days, and is
>>>> started every 7 days, assuming you are performing that cycles in the same
>>>> order around the nodes every time, a given node will have been repaired
>>>> within 7 days, even though the start of repair 1 to the finish of repair 2
>>>> was more than 7 days.  The start of “token ranges repaired on day 0” to the
>>>> finish of “token ranges repaired on day 7” is less than the gc_grace 
>>>> window.
>>>>
>>>> -Jeremiah Jordan
>>>>
>>>> On May 16, 2025 at 2:03:00 PM, Mike Sun <m...@msun.io> wrote:
>>>>
>>>>> The wording is subtle and can be confusing...
>>>>>
>>>>> It's important to distinguish between:
>>>>> 1. "You need to start and complete a repair within any gc_grace_seconds
>>>>> window"
>>>>> 2. "You need to start and complete a repair within gc_grace_seconds"
>>>>>
>>>>> #1 is a sliding time window for any time interval in which the tombstone
>>>>> (tombstone_created_time  is written and the expiration of
>>>>> it (tombstoned_created_time + gc_grace_seconds)
>>>>>
>>>>> #2 is a duration bound for the repair time
>>>>>
>>>>> My post is saying that to ensure the #1 requirement, you actually need
>>>>> to "start and complete two consecutive repairs within gc_grace_seconds"
>>>>>
>>>>>
>>>>> On Fri, May 16, 2025 at 2:49 PM Mike Sun <m...@msun.io> wrote:
>>>>>
>>>>>> > You need to *start and complete* a repair within any gc_grace_seconds
>>>>>> window.
>>>>>> Exactly this. And since "any gc_grace_seconds" does not mean "any
>>>>>> gc_grace_window from which a repair starts"... the requirement needs to 
>>>>>> be
>>>>>> that the duration to "start and complete" two consecutive full repairs is
>>>>>> within gc_grace_seconds"... that will ensure a repair "starts and
>>>>>> completes" within "any gc_grace_seconds" window
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, May 16, 2025 at 2:43 PM Mick Semb Wever <m...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>>     .
>>>>>>>
>>>>>>>
>>>>>>>> e.g., assume gc_grace_seconds=10 days, a repair takes 5 days to run
>>>>>>>> * Day 0: Repair 1 starts and processes token A
>>>>>>>> * Day 1: Token A is deleted resulting in Tombstone A that will
>>>>>>>> expire on Day 11
>>>>>>>> * Day 5: Repair 1 completes
>>>>>>>> * Day 7: Repair 2 starts
>>>>>>>> * Day 11: Tombstone A expires without being repaired
>>>>>>>> * Day 12: Repair 2 repairs Token A and completes
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> You need to *start and complete* a repair within any gc_grace_seconds
>>>>>>> window.
>>>>>>> In your example no repair started and completed in the Day 1-11
>>>>>>> window.
>>>>>>>
>>>>>>> We do need to word this better, thanks for pointing it out Mike.
>>>>>>>
>>>>>>

Re: Stricter repair time requirements necessary to prevent data resurrection than advised by docs

Reply via email to