+1 from me, any improvement in this area would be great.

It would be nice if this could include visibility into repair streams, but just 
exposing the repair state will be a big improvement.

> On Aug 25, 2021, at 5:46 PM, David Capwell <dcapw...@gmail.com> wrote:
> 
> Now that 4.0 is out, I want to bring up improving repair again (earlier
> thread
> http://mail-archives.apache.org/mod_mbox/cassandra-commits/201911.mbox/%3cjira.13266448.1572997299000.99567.1572997440...@atlassian.jira%3E),
> specifically the following two JIRAs:
> 
> 
> CASSANDRA-15566 - Repair coordinator can hang under some cases
> 
> CASSANDRA-15399 - Add ability to track state in repair
> 
> 
> Right now repair has an issue if any message is lost, which leads to hung
> or timed out repairs; in addition there is a large lack of visibility into
> what is going on, and can be even harder if you wish to join coordinator
> with participant state.
> 
> 
> I propose the following changes to improve our current repair subsystem:
> 
> 
> 
>   1. New tracking system for coordinator and participants (covered by
>   CASSANDRA-15399).  This system will expose progress on each instance and
>   expose this information for internal access as well as external users
>   2. Add retries to specific stages of coordination, such as prepare and
>   validate.  In order to do these retries we first need to know what the
>   state is for the participant which has yet to reply, this will leverage
>   CASSANDRA-15399 to see what's going on (has the prepare been seen?  Is
>   validation running? Did it complete?).  In addition to checking the
>   state, we will need to store the validation MerkleTree, this allows for
>   coordinator to fetch if goes missing (can be dropped in route to
>   coordinator or even on the coordinator).
> 
> 
> What is not in scope?
> 
>   - Rewriting all of Repair; the idea is specific "small" changes can fix
>   80% of the issues
>   - Handle coordinator node failure.  Being able to recover from a failed
>   coordinator should be possible after the above work is done, so is seen as
>   tangental and can be done later
>   - Recovery from a downed participant.  Similar to the previous bullet,
>   with the state being tracked this acts as a kind of checkpoint, so future
>   work can come in to handle recovery
>   - Handling "too large" range. Ideally we should add an ability to split
>   the coordination into sub repairs, but this is not the goal of this work.
>   - Overstreaming.  This is a byproduct of the previous "not in scope"
>   bullet, and/or large partitions; so is tangental to this work
> 
> 
> Wanted to share here before starting this work again; let me know if there
> are any concerns or feedback!


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org

Reply via email to