[DISCUSS] Backport CASSANDRA-18816 to 5.0? Add support for repair coordinator to retry messages that timeout

David Capwell Tue, 19 Sep 2023 11:16:23 -0700

To try to get repair more stable, I added optional retry logic (patch is still 
in review) to a handful of critical repair verbs.  This patch is disabled by 
default but allows you to opt-in to retries so ephemeral issues don’t cause a 
repair to fail after running for a long time (assuming they resolve within the 
retry window). There are 2 protocol level changes to enable this: 
VALIDATION_RSP and SYNC_RSP now send an ACK (if the sender doesn’t attach a 
callback, these ACKs get ignored in all versions; see 
org.apache.cassandra.net.ResponseVerbHandler#doVerb and Verb.REPAIR_RSP).  
Given that we have already forked, I believe we would need to give a waiver to 
allow this patch due to this change.


The patch was written on trunk, but figured back porting 5.0 would be rather 
trivial and this was brought up during the review, so floating this to a wider 
audience.

If you look at the patch you will see that it is very large, but this is only 
to make testing of repair coordination easier and deterministic, the biggest 
code changes are:

1) Moving from ActiveRepairService.instance to ActiveRepairService.instance() 
(this is the main reason so many files were touched; this was needed so unit 
tests don’t load the whole world)
2) Repair no longer reaches into global space and instead is provided the 
subsystems needed to perform repair; this change is local to repair code

Both of these changes were only for testing as they allow us to simulate 1k 
repairs in around 15 seconds with 100% deterministic execution.

[DISCUSS] Backport CASSANDRA-18816 to 5.0? Add support for repair coordinator to retry messages that timeout

Reply via email to