To try to get repair more stable, I added optional retry logic (patch is still in review) to a handful of critical repair verbs. This patch is disabled by default but allows you to opt-in to retries so ephemeral issues don’t cause a repair to fail after running for a long time (assuming they resolve within the retry window). There are 2 protocol level changes to enable this: VALIDATION_RSP and SYNC_RSP now send an ACK (if the sender doesn’t attach a callback, these ACKs get ignored in all versions; see org.apache.cassandra.net.ResponseVerbHandler#doVerb and Verb.REPAIR_RSP). Given that we have already forked, I believe we would need to give a waiver to allow this patch due to this change.
The patch was written on trunk, but figured back porting 5.0 would be rather trivial and this was brought up during the review, so floating this to a wider audience. If you look at the patch you will see that it is very large, but this is only to make testing of repair coordination easier and deterministic, the biggest code changes are: 1) Moving from ActiveRepairService.instance to ActiveRepairService.instance() (this is the main reason so many files were touched; this was needed so unit tests don’t load the whole world) 2) Repair no longer reaches into global space and instead is provided the subsystems needed to perform repair; this change is local to repair code Both of these changes were only for testing as they allow us to simulate 1k repairs in around 15 seconds with 100% deterministic execution.