Cool, moving this from dev list to JIRA, will start breaking down tasks and document my progress there
https://issues.apache.org/jira/browse/CASSANDRA-16909 > On Aug 27, 2021, at 1:21 PM, David Capwell <dcapw...@apple.com.INVALID> wrote: > > Push vs pull isn’t too critical, but there is one edge case to consider; if > we didn’t think the participate got restarted triggering validation again > (which may have caused the process to end) could be a problem. > >> On Aug 26, 2021, at 9:50 AM, Yifan Cai <yc25c...@gmail.com> wrote: >> >>> >>> 2. Add retries to specific stages of coordination, such as prepare and >>> validate. In order to do these retries we first need to know what the >> >> state is for the participant which has yet to reply... >> >> >> If I understand it correctly, does it mean retries only happen in the >> coordinator and the coordinator pulls the states of the participants >> periodically? >> If the handling of the requests in the participant is made to be idempotent >> (which I think is required for retry anyway), pulling the state is >> unnecessary. For example, the coordinator can just send the PrepareRequest >> at regular intervals until it receives the PrepareResponse. >> >> - Yifan >> >> On Thu, Aug 26, 2021 at 8:56 AM Blake Eggleston >> <beggles...@apple.com.invalid> wrote: >> >>> +1 from me, any improvement in this area would be great. >>> >>> It would be nice if this could include visibility into repair streams, but >>> just exposing the repair state will be a big improvement. >>> >>>> On Aug 25, 2021, at 5:46 PM, David Capwell <dcapw...@gmail.com> wrote: >>>> >>>> Now that 4.0 is out, I want to bring up improving repair again (earlier >>>> thread >>>> >>> http://mail-archives.apache.org/mod_mbox/cassandra-commits/201911.mbox/%3cjira.13266448.1572997299000.99567.1572997440...@atlassian.jira%3E >>> ), >>>> specifically the following two JIRAs: >>>> >>>> >>>> CASSANDRA-15566 - Repair coordinator can hang under some cases >>>> >>>> CASSANDRA-15399 - Add ability to track state in repair >>>> >>>> >>>> Right now repair has an issue if any message is lost, which leads to hung >>>> or timed out repairs; in addition there is a large lack of visibility >>> into >>>> what is going on, and can be even harder if you wish to join coordinator >>>> with participant state. >>>> >>>> >>>> I propose the following changes to improve our current repair subsystem: >>>> >>>> >>>> >>>> 1. New tracking system for coordinator and participants (covered by >>>> CASSANDRA-15399). This system will expose progress on each instance >>> and >>>> expose this information for internal access as well as external users >>>> 2. Add retries to specific stages of coordination, such as prepare and >>>> validate. In order to do these retries we first need to know what the >>>> state is for the participant which has yet to reply, this will leverage >>>> CASSANDRA-15399 to see what's going on (has the prepare been seen? Is >>>> validation running? Did it complete?). In addition to checking the >>>> state, we will need to store the validation MerkleTree, this allows for >>>> coordinator to fetch if goes missing (can be dropped in route to >>>> coordinator or even on the coordinator). >>>> >>>> >>>> What is not in scope? >>>> >>>> - Rewriting all of Repair; the idea is specific "small" changes can fix >>>> 80% of the issues >>>> - Handle coordinator node failure. Being able to recover from a failed >>>> coordinator should be possible after the above work is done, so is >>> seen as >>>> tangental and can be done later >>>> - Recovery from a downed participant. Similar to the previous bullet, >>>> with the state being tracked this acts as a kind of checkpoint, so >>> future >>>> work can come in to handle recovery >>>> - Handling "too large" range. Ideally we should add an ability to split >>>> the coordination into sub repairs, but this is not the goal of this >>> work. >>>> - Overstreaming. This is a byproduct of the previous "not in scope" >>>> bullet, and/or large partitions; so is tangental to this work >>>> >>>> >>>> Wanted to share here before starting this work again; let me know if >>> there >>>> are any concerns or feedback! >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org >>> For additional commands, e-mail: dev-h...@cassandra.apache.org >>> >>> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org > For additional commands, e-mail: dev-h...@cassandra.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org For additional commands, e-mail: dev-h...@cassandra.apache.org