> > 2. Add retries to specific stages of coordination, such as prepare and > validate. In order to do these retries we first need to know what the
state is for the participant which has yet to reply... If I understand it correctly, does it mean retries only happen in the coordinator and the coordinator pulls the states of the participants periodically? If the handling of the requests in the participant is made to be idempotent (which I think is required for retry anyway), pulling the state is unnecessary. For example, the coordinator can just send the PrepareRequest at regular intervals until it receives the PrepareResponse. - Yifan On Thu, Aug 26, 2021 at 8:56 AM Blake Eggleston <beggles...@apple.com.invalid> wrote: > +1 from me, any improvement in this area would be great. > > It would be nice if this could include visibility into repair streams, but > just exposing the repair state will be a big improvement. > > > On Aug 25, 2021, at 5:46 PM, David Capwell <dcapw...@gmail.com> wrote: > > > > Now that 4.0 is out, I want to bring up improving repair again (earlier > > thread > > > http://mail-archives.apache.org/mod_mbox/cassandra-commits/201911.mbox/%3cjira.13266448.1572997299000.99567.1572997440...@atlassian.jira%3E > ), > > specifically the following two JIRAs: > > > > > > CASSANDRA-15566 - Repair coordinator can hang under some cases > > > > CASSANDRA-15399 - Add ability to track state in repair > > > > > > Right now repair has an issue if any message is lost, which leads to hung > > or timed out repairs; in addition there is a large lack of visibility > into > > what is going on, and can be even harder if you wish to join coordinator > > with participant state. > > > > > > I propose the following changes to improve our current repair subsystem: > > > > > > > > 1. New tracking system for coordinator and participants (covered by > > CASSANDRA-15399). This system will expose progress on each instance > and > > expose this information for internal access as well as external users > > 2. Add retries to specific stages of coordination, such as prepare and > > validate. In order to do these retries we first need to know what the > > state is for the participant which has yet to reply, this will leverage > > CASSANDRA-15399 to see what's going on (has the prepare been seen? Is > > validation running? Did it complete?). In addition to checking the > > state, we will need to store the validation MerkleTree, this allows for > > coordinator to fetch if goes missing (can be dropped in route to > > coordinator or even on the coordinator). > > > > > > What is not in scope? > > > > - Rewriting all of Repair; the idea is specific "small" changes can fix > > 80% of the issues > > - Handle coordinator node failure. Being able to recover from a failed > > coordinator should be possible after the above work is done, so is > seen as > > tangental and can be done later > > - Recovery from a downed participant. Similar to the previous bullet, > > with the state being tracked this acts as a kind of checkpoint, so > future > > work can come in to handle recovery > > - Handling "too large" range. Ideally we should add an ability to split > > the coordination into sub repairs, but this is not the goal of this > work. > > - Overstreaming. This is a byproduct of the previous "not in scope" > > bullet, and/or large partitions; so is tangental to this work > > > > > > Wanted to share here before starting this work again; let me know if > there > > are any concerns or feedback! > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org > For additional commands, e-mail: dev-h...@cassandra.apache.org > >