I think it could be argued that not retrying messages is a bug, I am +1 on including this in 5.0.
Kind Regards, Brandon On Tue, Sep 19, 2023 at 1:16 PM David Capwell <dcapw...@apple.com> wrote: > > To try to get repair more stable, I added optional retry logic (patch is > still in review) to a handful of critical repair verbs. This patch is > disabled by default but allows you to opt-in to retries so ephemeral issues > don’t cause a repair to fail after running for a long time (assuming they > resolve within the retry window). There are 2 protocol level changes to > enable this: VALIDATION_RSP and SYNC_RSP now send an ACK (if the sender > doesn’t attach a callback, these ACKs get ignored in all versions; see > org.apache.cassandra.net.ResponseVerbHandler#doVerb and Verb.REPAIR_RSP). > Given that we have already forked, I believe we would need to give a waiver > to allow this patch due to this change. > > The patch was written on trunk, but figured back porting 5.0 would be rather > trivial and this was brought up during the review, so floating this to a > wider audience. > > If you look at the patch you will see that it is very large, but this is only > to make testing of repair coordination easier and deterministic, the biggest > code changes are: > > 1) Moving from ActiveRepairService.instance to ActiveRepairService.instance() > (this is the main reason so many files were touched; this was needed so unit > tests don’t load the whole world) > 2) Repair no longer reaches into global space and instead is provided the > subsystems needed to perform repair; this change is local to repair code > > Both of these changes were only for testing as they allow us to simulate 1k > repairs in around 15 seconds with 100% deterministic execution.