Now that 4.0 is out, I want to bring up improving repair again (earlier
thread
http://mail-archives.apache.org/mod_mbox/cassandra-commits/201911.mbox/%3cjira.13266448.1572997299000.99567.1572997440...@atlassian.jira%3E),
specifically the following two JIRAs:


CASSANDRA-15566 - Repair coordinator can hang under some cases

CASSANDRA-15399 - Add ability to track state in repair


Right now repair has an issue if any message is lost, which leads to hung
or timed out repairs; in addition there is a large lack of visibility into
what is going on, and can be even harder if you wish to join coordinator
with participant state.


I propose the following changes to improve our current repair subsystem:



   1. New tracking system for coordinator and participants (covered by
   CASSANDRA-15399).  This system will expose progress on each instance and
   expose this information for internal access as well as external users
   2. Add retries to specific stages of coordination, such as prepare and
   validate.  In order to do these retries we first need to know what the
   state is for the participant which has yet to reply, this will leverage
   CASSANDRA-15399 to see what's going on (has the prepare been seen?  Is
   validation running? Did it complete?).  In addition to checking the
   state, we will need to store the validation MerkleTree, this allows for
   coordinator to fetch if goes missing (can be dropped in route to
   coordinator or even on the coordinator).


What is not in scope?

   - Rewriting all of Repair; the idea is specific "small" changes can fix
   80% of the issues
   - Handle coordinator node failure.  Being able to recover from a failed
   coordinator should be possible after the above work is done, so is seen as
   tangental and can be done later
   - Recovery from a downed participant.  Similar to the previous bullet,
   with the state being tracked this acts as a kind of checkpoint, so future
   work can come in to handle recovery
   - Handling "too large" range. Ideally we should add an ability to split
   the coordination into sub repairs, but this is not the goal of this work.
   - Overstreaming.  This is a byproduct of the previous "not in scope"
   bullet, and/or large partitions; so is tangental to this work


Wanted to share here before starting this work again; let me know if there
are any concerns or feedback!

Reply via email to