awesome! @Rui Thanks for your effort! Appreciate it! Best regards, Jing
On Tue, Nov 14, 2023 at 1:32 PM Rui Fan <1996fan...@gmail.com> wrote: > Thanks a lot Zhu and Jing for the comments! > > Regarding concurrent failures mentioned by zhu, I am not familiar with it > before > and need some time to get familiar with it. So I will reply to them later. > > I will give Jing an answer first: > > > NIT: @Rui it would be great if you could point out the source code > > reference where the restartAttempt will be calculated rapidly, > > i.e. +100 with your one task example. All related docs I could find > didn't > > describe it clearly. > > I added a new part to explain why the restartAttempt is increased rapidly > in FLIP[1]. > And give an example to reproduce it. > > If all tasks of one job throw an exception directly and have the following > options, > we expect that this job can be retried 5 times with delay 10s each time, > and the > Flink job will fail after 50s: > - restart-strategy : fixed-delay > - restart-strategy.fixed-delay.attempts : 5 > - restart-strategy.fixed-delay.delay : 10 s > > Here is result for 3 cases: > - case1 is default(streaming job with region failover): Job execution > failed immediately (It's unexpected) > - case2 is streaming job with full failover: Job execution failed after 50s > (5 * 10s) (it's expected) > - case3 is batch job with region failover: Job execution failed immediately > (It's unexpected) > > And case1 and case3 are unexpected. > Note: The demo code can be found in FLIP, it's very simple. > > Why is this happening? Let me briefly summarize the reasons: > > When one task fails, it will call RestartBackoffTimeStrategy#notifyFailure, > and FixedDelayRestartBackoffTimeStrategy will call currentRestartAttempt++. > > All tasks are one region: > When the first task fails, JobMaster will mark the rest of the tasks as > ExecutionState#CANCELING. > > When rest tasks report exceptions, they will switch state from CANCELING to > CANCELED instead of FAILED. So rest tasks won't call failed logic > (Don't increase the currentRestartAttempt). > > Why doesn't region-failover work well? > > If one job has 100 regions, and every region just has one task. If one task > fails, > JobMaster doesn't update other tasks. If rest tasks report exceptions, they > will > increase the currentRestartAttempt. > > Detailed call stacks and code paths can be found in FLIP. > > Please let me know if my understanding is wrong or if it's not clear, > thanks~ > > [1] https://cwiki.apache.org/confluence/x/uJqzDw > > Best, > Rui > > On Tue, Nov 14, 2023 at 5:10 AM Jing Ge <j...@ververica.com.invalid> > wrote: > > > Hi Rui, > > > > Thanks for the proposal! I agree with Zhu that any changes of the default > > behaviors will have impact on users' jobs in the production environment > and > > it would be necessary to have users' attention to to avoid > > any surprises after upgrading Flink. > > > > @Zhu > > for 1, if we change the default values, we will not only change the > > behavior of the default restart-strategy but also change the default > > behavior of exponential-delay strategy for users who are explicitly > > configured to use the strategy with default values in the production. > > > > NIT: @Rui it would be great if you could point out the source code > > reference where the restartAttempt will be calculated rapidly, > > i.e. +100 with your one task example. All related docs I could find > didn't > > describe it clearly. > > > > Best regards, > > Jing > > > > > > On Mon, Nov 13, 2023 at 11:55 AM Zhu Zhu <reed...@gmail.com> wrote: > > > > > Hi Rui, > > > Thanks for creating this FLIP and sorry for jumping in so late into the > > > discussion. > > > > > > The improvements of exponential-delay strategy and making it the > default > > > strategy looks good it me in general. I have some comments for it, as > > well > > > as for the failure counting. > > > > > > 1. default values of exponential-delay configuration > > > It is mentioned in the FLIP that "the default value of these options > are > > > not changed, we just change them for default restart-strategy. The > > default > > > restart-strategy just takes effect if checkpointing is enabled and the > > > user doesn’t configure the restart-strategy". I'm a bit concerned with > > > the inconsistency which may cause confusion to users. So if the > proposed > > > configuration values work better in most cases, I'm leaning towards to > > > change the default values. > > > > > > 2. the default tolerable failure count > > > Currently, the restart-strategy behaves like this by default(if not > > > configured): > > > 1. job retries on failures indefinitely if checkpointing is enabled > > > 2. job fails on any failure if checkpointing is disabled > > > Changing it to always restart indefinitely may result in unexpected > > > behaviors in production. Therefore, either we should do the same > > > thing to exponential-delay, or have an open discussion(also involving > > > the user ML) on changing this default behavior. > > > > > > 3. failure counting > > > Flink currently will try to recognize concurrent failures and group > them > > > together, which can be seen in the web UI. So how about to align the > > > failure counting with the concurrent failures computing? This can make > it > > > more consistent and easier for understanding. It will require changes > to > > > the concurrent failures computing though, i.e. taking the backoff time > > > into consideration. So maybe we can open a seperate FLIP for this > change. > > > > > > Thanks, > > > Zhu > > > > > > Rui Fan <1996fan...@gmail.com> 于2023年11月10日周五 18:22写道: > > > > > > > I'll start voting next Monday if there isn't any other comment. > > > > > > > > Best, > > > > Rui > > > > > > > > On Thu, Oct 19, 2023 at 6:59 PM Rui Fan <1996fan...@gmail.com> > wrote: > > > > > > > > > Hi Konstantin and Max, > > > > > > > > > > Thanks for your feedback! > > > > > > > > > > Sorry, I forgot to mention the default value of > > > > > > > `restart-strategy.exponential-delay.max-attempts-before-reset-backoff`. > > > > > > > > > > Retrying forever sounds good to me, I have added it to the FLIP: > > > > > > > > > > The default value of > > > > > > > `restart-strategy.exponential-delay.max-attempts-before-reset-backoff` > > > is > > > > > Integer.MAX_VALUE. > > > > > > > > > > Best, > > > > > Rui > > > > > > > > > > On Thu, Oct 19, 2023 at 6:29 PM Maximilian Michels <m...@apache.org > > > > > > wrote: > > > > > > > > > >> Hey Rui, > > > > >> > > > > >> +1 for making exponential backoff the default. I agree with > > Konstantin > > > > >> that retrying forever is a good default for exponential backoff > > > > >> because oftentimes the issue will resolve eventually. The purpose > of > > > > >> exponential backoff is precisely to continue to retry without > > causing > > > > >> too much load. However, I'm not against adding an optional max > > number > > > > >> of retries. > > > > >> > > > > >> -Max > > > > >> > > > > >> On Thu, Oct 19, 2023 at 11:35 AM Konstantin Knauf < > > kna...@apache.org> > > > > >> wrote: > > > > >> > > > > > >> > Hi Rui, > > > > >> > > > > > >> > Thank you for this proposal and working on this. I also agree > that > > > > >> > exponential back off makes sense as a new default in general. I > > > think > > > > >> > restarting indefinitely (no max attempts) makes sense by > default, > > > > >> though, > > > > >> > but of course allowing users to change is valuable. > > > > >> > > > > > >> > So, overall +1. > > > > >> > > > > > >> > Cheers, > > > > >> > > > > > >> > Konstantin > > > > >> > > > > > >> > Am Di., 17. Okt. 2023 um 07:11 Uhr schrieb Rui Fan < > > > > >> 1996fan...@gmail.com>: > > > > >> > > > > > >> > > Hi all, > > > > >> > > > > > > >> > > I would like to start a discussion on FLIP-364: Improve the > > > > >> > > restart-strategy[1] > > > > >> > > > > > > >> > > As we know, the restart-strategy is critical for flink jobs, > it > > > > mainly > > > > >> > > has two functions: > > > > >> > > 1. When an exception occurs in the flink job, quickly restart > > the > > > > job > > > > >> > > so that the job can return to the running state. > > > > >> > > 2. When a job cannot be recovered after frequent restarts > within > > > > >> > > a certain period of time, Flink will not retry but will fail > the > > > > job. > > > > >> > > > > > > >> > > The current restart-strategy support for function 2 has some > > > issues: > > > > >> > > 1. The exponential-delay doesn't have the max attempts > > mechanism, > > > > >> > > it means that flink will restart indefinitely even if it fails > > > > >> frequently. > > > > >> > > 2. For multi-region streaming jobs and all batch jobs, the > > failure > > > > of > > > > >> > > each region will increase the total number of job failures by > > +1, > > > > >> > > even if these failures occur at the same time. If the number > of > > > > >> > > failures increases too quickly, it will be difficult to set a > > > > >> reasonable > > > > >> > > number of retries. > > > > >> > > If the maximum number of failures is set too low, the job can > > > easily > > > > >> > > reach the retry limit, causing the job to fail. If set too > high, > > > > some > > > > >> jobs > > > > >> > > will never fail. > > > > >> > > > > > > >> > > In addition, when the above two problems are solved, we can > also > > > > >> > > discuss whether exponential-delay can replace fixed-delay as > the > > > > >> > > default restart-strategy. In theory, exponential-delay is > > smarter > > > > and > > > > >> > > friendlier than fixed-delay. > > > > >> > > > > > > >> > > I also thank Zhu Zhu for his suggestions on the option name in > > > > >> > > FLINK-32895[2] in advance. > > > > >> > > > > > > >> > > Looking forward to and welcome everyone's feedback and > > > suggestions, > > > > >> thank > > > > >> > > you. > > > > >> > > > > > > >> > > [1] https://cwiki.apache.org/confluence/x/uJqzDw > > > > >> > > [2] https://issues.apache.org/jira/browse/FLINK-32895 > > > > >> > > > > > > >> > > Best, > > > > >> > > Rui > > > > >> > > > > > > >> > > > > > >> > > > > > >> > -- > > > > >> > https://twitter.com/snntrable > > > > >> > https://github.com/knaufk > > > > >> > > > > > > > > > > > > > > >