Re: Stricter repair time requirements necessary to prevent data resurrection than advised by docs

Mike Sun Mon, 19 May 2025 06:28:35 -0700

>
> To simplify operations, the newly introduced in-built AutoRepair feature
>> in Cassandra (as part of CEP-37
>> <https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-37%3A+Apache+Cassandra+Unified+Repair+Solution>)
>> includes intelligent behavior that tracks the oldest repaired node in the
>> cluster and prioritizes it for repair. It also emits a range of metrics to
>> assist operators. One key metric, LongestUnrepairedSec
>> <https://github.com/apache/cassandra/blob/trunk/doc/modules/cassandra/pages/managing/operating/metrics.adoc#automated-repair-metrics>,
>> indicates how long it has been since the last repair for any part of the
>> data. Operators can create an alarm on the metric if it becomes higher than
>> the *gc_grace_seconds*.
>
>
> This is great to hear! Thanks for pointing me to that Jaydeep. It will
definitely make it easier for operators to monitor and alarm on potential
expiring tombstone risks. I will update my post to include this upcoming
feature.


Best,
Mike Sun

On Sat, May 17, 2025 at 12:54 PM Mike Sun <[email protected]> wrote:
>
>> Jeremiah, you’re right, I’ve been using “repair” to mean a cluster-level
>> repair as opposed to a single “nodetool repair” operation, and the
>> Cassandra docs mean “nodetool repair” when referring to a repair. Thanks
>> for pointing that out! I agree that the recommendation to run a “nodetool
>> repair” on every node or token range every 7 days with a gc_grace_seconds =
>> 10 days should practically prevent data resurrection.
>>
>> I still think theoretically though, starting and completing each nodetool
>> repair operation within gc_grace_seconds won't absolutely guarantee that
>> there’s no chance of an expired tombstone. nodetool repair operations on
>> the same node+token range(s) don't always take the same amount of time to
>> run and therefore don’t guarantee that specific tokens are always repaired
>> at the same elapsed time.
>>
>> e.g. if gc_grace_seconds=10 hours, nodetool repair is run every 7 hours,
>> nodetool repair operations can take between 2 to 5 hours
>>
>>    - 00:00 - nodetool repair 1 starts on node A
>>    - 00:30 - nodetool repair 1 repairs token T
>>    - 01:00 - token T is deleted
>>    - 02:00 - nodetool repair 1 completes
>>    - 07:00 - nodetool repair 2 starts on node A
>>    - 11:00 - tombstone for token T expires
>>    - 11:30 - nodetool repair 2 repairs token T
>>    - 12:00 - nodetool repair completes
>>
>> In reality, I agree this is very unlikely to happen. But if we’re looking
>> to establish a rigorous requirement that prevents any chance of data
>> resurrection, then I believe it’s the invariant I proposed for
>> “cluster-level repairs”—that two consecutive complete repairs must succeed
>> within gc_grace_seconds. Theoretical risk of data resurrection is something
>> that keeps me up at night! :).
>>
>> More practically, in my experience with Cassandra and Scylla clusters, I
>> think most operators reason about repairs as “cluster-level” as opposed to
>> individual “nodetool repair” operations, especially due to the use of
>> Reaper for Cassandra and Scylla Manager. Reaper and Scylla Manager repairs
>> jobs are cluster-level and repair admin+monitoring is generally at the
>> cluster-level, e.g. cluster-level repair schedules, durations,
>> success/completions.
>>
>> Repairs managed by Reaper and Scylla Manager do not guarantee a
>> deterministic ordering or timing of individual nodetool repair operations
>> they manage between separate cycles, breaking the "you are performing the
>> cycles in the same order around the nodes every time” assumption. That’s
>> the context from which my original cluster-level repair example comes from.
>>
>> Thanks for the helpful discussion, I will update my blog post to reflect
>> the helpful clarifications!
>>
>> On Fri, May 16, 2025 at 5:25 PM Jeremiah Jordan <[email protected]>
>> wrote:
>>
>>> I agree we need to do a better job and wording this so people can
>>> understand what is happening.
>>>
>>> For your exact example here, you are actually looking at too broad of a
>>> thing.  The exact requirements are not at the full cluster level, but
>>> actually at the “token range” level at which repair operates, a given token
>>> range needs to have repair start and complete within the gc_grace sliding
>>> window.  For your example of a repair cycle that takes 5 days, and is
>>> started every 7 days, assuming you are performing that cycles in the same
>>> order around the nodes every time, a given node will have been repaired
>>> within 7 days, even though the start of repair 1 to the finish of repair 2
>>> was more than 7 days.  The start of “token ranges repaired on day 0” to the
>>> finish of “token ranges repaired on day 7” is less than the gc_grace window.
>>>
>>> -Jeremiah Jordan
>>>
>>> On May 16, 2025 at 2:03:00 PM, Mike Sun <[email protected]> wrote:
>>>
>>>> The wording is subtle and can be confusing...
>>>>
>>>> It's important to distinguish between:
>>>> 1. "You need to start and complete a repair within any gc_grace_seconds
>>>> window"
>>>> 2. "You need to start and complete a repair within gc_grace_seconds"
>>>>
>>>> #1 is a sliding time window for any time interval in which the tombstone
>>>> (tombstone_created_time  is written and the expiration of
>>>> it (tombstoned_created_time + gc_grace_seconds)
>>>>
>>>> #2 is a duration bound for the repair time
>>>>
>>>> My post is saying that to ensure the #1 requirement, you actually need
>>>> to "start and complete two consecutive repairs within gc_grace_seconds"
>>>>
>>>>
>>>> On Fri, May 16, 2025 at 2:49 PM Mike Sun <[email protected]> wrote:
>>>>
>>>>> > You need to *start and complete* a repair within any gc_grace_seconds
>>>>> window.
>>>>> Exactly this. And since "any gc_grace_seconds" does not mean "any
>>>>> gc_grace_window from which a repair starts"... the requirement needs to be
>>>>> that the duration to "start and complete" two consecutive full repairs is
>>>>> within gc_grace_seconds"... that will ensure a repair "starts and
>>>>> completes" within "any gc_grace_seconds" window
>>>>>
>>>>>
>>>>>
>>>>> On Fri, May 16, 2025 at 2:43 PM Mick Semb Wever <[email protected]>
>>>>> wrote:
>>>>>
>>>>>>     .
>>>>>>
>>>>>>
>>>>>>> e.g., assume gc_grace_seconds=10 days, a repair takes 5 days to run
>>>>>>> * Day 0: Repair 1 starts and processes token A
>>>>>>> * Day 1: Token A is deleted resulting in Tombstone A that will
>>>>>>> expire on Day 11
>>>>>>> * Day 5: Repair 1 completes
>>>>>>> * Day 7: Repair 2 starts
>>>>>>> * Day 11: Tombstone A expires without being repaired
>>>>>>> * Day 12: Repair 2 repairs Token A and completes
>>>>>>>
>>>>>>
>>>>>>
>>>>>> You need to *start and complete* a repair within any gc_grace_seconds
>>>>>> window.
>>>>>> In your example no repair started and completed in the Day 1-11
>>>>>> window.
>>>>>>
>>>>>> We do need to word this better, thanks for pointing it out Mike.
>>>>>>
>>>>>

Re: Stricter repair time requirements necessary to prevent data resurrection than advised by docs

Reply via email to