I support including this in 5.0. This looks to me like a significant correctness and stabilization effort, very similar to other large bodies of work we merged in post freeze for testing and stabilizing 4.0.
On Tue, Sep 19, 2023, at 5:42 PM, Chris Lohfink wrote: > I absolutely love the idea of this being in 5.0, I am +1 for what it is worth > > On Tue, Sep 19, 2023 at 4:04 PM David Capwell <dcapw...@apple.com> wrote: >> To try to get repair more stable, I added optional retry logic (patch is >> still in review) to a handful of critical repair verbs. This patch is >> disabled by default but allows you to opt-in to retries so ephemeral issues >> don’t cause a repair to fail after running for a long time (assuming they >> resolve within the retry window). There are 2 protocol level changes to >> enable this: VALIDATION_RSP and SYNC_RSP now send an ACK (if the sender >> doesn’t attach a callback, these ACKs get ignored in all versions; see >> org.apache.cassandra.net.ResponseVerbHandler#doVerb and Verb.REPAIR_RSP). >> Given that we have already forked, I believe we would need to give a waiver >> to allow this patch due to this change. >> >> The patch was written on trunk, but figured back porting 5.0 would be rather >> trivial and this was brought up during the review, so floating this to a >> wider audience. >> >> If you look at the patch you will see that it is very large, but this is >> only to make testing of repair coordination easier and deterministic, the >> biggest code changes are: >> >> 1) Moving from ActiveRepairService.instance to >> ActiveRepairService.instance() (this is the main reason so many files were >> touched; this was needed so unit tests don’t load the whole world) >> 2) Repair no longer reaches into global space and instead is provided the >> subsystems needed to perform repair; this change is local to repair code >> >> Both of these changes were only for testing as they allow us to simulate 1k >> repairs in around 15 seconds with 100% deterministic execution.