Re: Stricter repair time requirements necessary to prevent data resurrection than advised by docs

Mike Sun Sat, 17 May 2025 12:54:56 -0700

Jeremiah, you’re right, I’ve been using “repair” to mean a cluster-level
repair as opposed to a single “nodetool repair” operation, and the
Cassandra docs mean “nodetool repair” when referring to a repair. Thanks
for pointing that out! I agree that the recommendation to run a “nodetool
repair” on every node or token range every 7 days with a gc_grace_seconds =
10 days should practically prevent data resurrection.

I still think theoretically though, starting and completing each nodetool
repair operation within gc_grace_seconds won't absolutely guarantee that
there’s no chance of an expired tombstone. nodetool repair operations on
the same node+token range(s) don't always take the same amount of time to
run and therefore don’t guarantee that specific tokens are always repaired
at the same elapsed time.

e.g. if gc_grace_seconds=10 hours, nodetool repair is run every 7 hours,
nodetool repair operations can take between 2 to 5 hours

   - 00:00 - nodetool repair 1 starts on node A
   - 00:30 - nodetool repair 1 repairs token T
   - 01:00 - token T is deleted
   - 02:00 - nodetool repair 1 completes
   - 07:00 - nodetool repair 2 starts on node A
   - 11:00 - tombstone for token T expires
   - 11:30 - nodetool repair 2 repairs token T
   - 12:00 - nodetool repair completes

In reality, I agree this is very unlikely to happen. But if we’re looking
to establish a rigorous requirement that prevents any chance of data
resurrection, then I believe it’s the invariant I proposed for
“cluster-level repairs”—that two consecutive complete repairs must succeed
within gc_grace_seconds. Theoretical risk of data resurrection is something
that keeps me up at night! :).

More practically, in my experience with Cassandra and Scylla clusters, I
think most operators reason about repairs as “cluster-level” as opposed to
individual “nodetool repair” operations, especially due to the use of
Reaper for Cassandra and Scylla Manager. Reaper and Scylla Manager repairs
jobs are cluster-level and repair admin+monitoring is generally at the
cluster-level, e.g. cluster-level repair schedules, durations,
success/completions.

Repairs managed by Reaper and Scylla Manager do not guarantee a
deterministic ordering or timing of individual nodetool repair operations
they manage between separate cycles, breaking the "you are performing the
cycles in the same order around the nodes every time” assumption. That’s
the context from which my original cluster-level repair example comes from.

Thanks for the helpful discussion, I will update my blog post to reflect
the helpful clarifications!

On Fri, May 16, 2025 at 5:25 PM Jeremiah Jordan <[email protected]> wrote:

> I agree we need to do a better job and wording this so people can
> understand what is happening.
>
> For your exact example here, you are actually looking at too broad of a
> thing.  The exact requirements are not at the full cluster level, but
> actually at the “token range” level at which repair operates, a given token
> range needs to have repair start and complete within the gc_grace sliding
> window.  For your example of a repair cycle that takes 5 days, and is
> started every 7 days, assuming you are performing that cycles in the same
> order around the nodes every time, a given node will have been repaired
> within 7 days, even though the start of repair 1 to the finish of repair 2
> was more than 7 days.  The start of “token ranges repaired on day 0” to the
> finish of “token ranges repaired on day 7” is less than the gc_grace window.
>
> -Jeremiah Jordan
>
> On May 16, 2025 at 2:03:00 PM, Mike Sun <[email protected]> wrote:
>
>> The wording is subtle and can be confusing...
>>
>> It's important to distinguish between:
>> 1. "You need to start and complete a repair within any gc_grace_seconds
>> window"
>> 2. "You need to start and complete a repair within gc_grace_seconds"
>>
>> #1 is a sliding time window for any time interval in which the tombstone
>> (tombstone_created_time  is written and the expiration of
>> it (tombstoned_created_time + gc_grace_seconds)
>>
>> #2 is a duration bound for the repair time
>>
>> My post is saying that to ensure the #1 requirement, you actually need to
>> "start and complete two consecutive repairs within gc_grace_seconds"
>>
>>
>> On Fri, May 16, 2025 at 2:49 PM Mike Sun <[email protected]> wrote:
>>
>>> > You need to *start and complete* a repair within any gc_grace_seconds
>>> window.
>>> Exactly this. And since "any gc_grace_seconds" does not mean "any
>>> gc_grace_window from which a repair starts"... the requirement needs to be
>>> that the duration to "start and complete" two consecutive full repairs is
>>> within gc_grace_seconds"... that will ensure a repair "starts and
>>> completes" within "any gc_grace_seconds" window
>>>
>>>
>>>
>>> On Fri, May 16, 2025 at 2:43 PM Mick Semb Wever <[email protected]> wrote:
>>>
>>>>     .
>>>>
>>>>
>>>>> e.g., assume gc_grace_seconds=10 days, a repair takes 5 days to run
>>>>> * Day 0: Repair 1 starts and processes token A
>>>>> * Day 1: Token A is deleted resulting in Tombstone A that will expire
>>>>> on Day 11
>>>>> * Day 5: Repair 1 completes
>>>>> * Day 7: Repair 2 starts
>>>>> * Day 11: Tombstone A expires without being repaired
>>>>> * Day 12: Repair 2 repairs Token A and completes
>>>>>
>>>>
>>>>
>>>> You need to *start and complete* a repair within any gc_grace_seconds
>>>> window.
>>>> In your example no repair started and completed in the Day 1-11 window.
>>>>
>>>> We do need to word this better, thanks for pointing it out Mike.
>>>>
>>>

Re: Stricter repair time requirements necessary to prevent data resurrection than advised by docs

Reply via email to