+1 from me, any improvement in this area would be great. It would be nice if this could include visibility into repair streams, but just exposing the repair state will be a big improvement.
> On Aug 25, 2021, at 5:46 PM, David Capwell <dcapw...@gmail.com> wrote: > > Now that 4.0 is out, I want to bring up improving repair again (earlier > thread > http://mail-archives.apache.org/mod_mbox/cassandra-commits/201911.mbox/%3cjira.13266448.1572997299000.99567.1572997440...@atlassian.jira%3E), > specifically the following two JIRAs: > > > CASSANDRA-15566 - Repair coordinator can hang under some cases > > CASSANDRA-15399 - Add ability to track state in repair > > > Right now repair has an issue if any message is lost, which leads to hung > or timed out repairs; in addition there is a large lack of visibility into > what is going on, and can be even harder if you wish to join coordinator > with participant state. > > > I propose the following changes to improve our current repair subsystem: > > > > 1. New tracking system for coordinator and participants (covered by > CASSANDRA-15399). This system will expose progress on each instance and > expose this information for internal access as well as external users > 2. Add retries to specific stages of coordination, such as prepare and > validate. In order to do these retries we first need to know what the > state is for the participant which has yet to reply, this will leverage > CASSANDRA-15399 to see what's going on (has the prepare been seen? Is > validation running? Did it complete?). In addition to checking the > state, we will need to store the validation MerkleTree, this allows for > coordinator to fetch if goes missing (can be dropped in route to > coordinator or even on the coordinator). > > > What is not in scope? > > - Rewriting all of Repair; the idea is specific "small" changes can fix > 80% of the issues > - Handle coordinator node failure. Being able to recover from a failed > coordinator should be possible after the above work is done, so is seen as > tangental and can be done later > - Recovery from a downed participant. Similar to the previous bullet, > with the state being tracked this acts as a kind of checkpoint, so future > work can come in to handle recovery > - Handling "too large" range. Ideally we should add an ability to split > the coordination into sub repairs, but this is not the goal of this work. > - Overstreaming. This is a byproduct of the previous "not in scope" > bullet, and/or large partitions; so is tangental to this work > > > Wanted to share here before starting this work again; let me know if there > are any concerns or feedback! --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org For additional commands, e-mail: dev-h...@cassandra.apache.org