Now that 4.0 is out, I want to bring up improving repair again (earlier thread http://mail-archives.apache.org/mod_mbox/cassandra-commits/201911.mbox/%3cjira.13266448.1572997299000.99567.1572997440...@atlassian.jira%3E), specifically the following two JIRAs:
CASSANDRA-15566 - Repair coordinator can hang under some cases CASSANDRA-15399 - Add ability to track state in repair Right now repair has an issue if any message is lost, which leads to hung or timed out repairs; in addition there is a large lack of visibility into what is going on, and can be even harder if you wish to join coordinator with participant state. I propose the following changes to improve our current repair subsystem: 1. New tracking system for coordinator and participants (covered by CASSANDRA-15399). This system will expose progress on each instance and expose this information for internal access as well as external users 2. Add retries to specific stages of coordination, such as prepare and validate. In order to do these retries we first need to know what the state is for the participant which has yet to reply, this will leverage CASSANDRA-15399 to see what's going on (has the prepare been seen? Is validation running? Did it complete?). In addition to checking the state, we will need to store the validation MerkleTree, this allows for coordinator to fetch if goes missing (can be dropped in route to coordinator or even on the coordinator). What is not in scope? - Rewriting all of Repair; the idea is specific "small" changes can fix 80% of the issues - Handle coordinator node failure. Being able to recover from a failed coordinator should be possible after the above work is done, so is seen as tangental and can be done later - Recovery from a downed participant. Similar to the previous bullet, with the state being tracked this acts as a kind of checkpoint, so future work can come in to handle recovery - Handling "too large" range. Ideally we should add an ability to split the coordination into sub repairs, but this is not the goal of this work. - Overstreaming. This is a byproduct of the previous "not in scope" bullet, and/or large partitions; so is tangental to this work Wanted to share here before starting this work again; let me know if there are any concerns or feedback!