Re: [DISCUSS] FLIP-364: Improve the restart-strategy

Jing Ge Tue, 14 Nov 2023 13:01:30 -0800

awesome! @Rui Thanks for your effort! Appreciate it!

Best regards,
Jing


On Tue, Nov 14, 2023 at 1:32 PM Rui Fan <1996fan...@gmail.com> wrote:

> Thanks a lot Zhu and Jing for the comments!
>
> Regarding concurrent failures mentioned by zhu, I am not familiar with it
> before
> and need some time to get familiar with it. So I will reply to them later.
>
> I will give Jing an answer first:
>
> > NIT: @Rui it would be great if you could point out the source code
> > reference where the restartAttempt will be calculated rapidly,
> > i.e. +100 with your one task example. All related docs I could find
> didn't
> > describe it clearly.
>
> I added a new part to explain why the restartAttempt is increased rapidly
> in FLIP[1].
> And give an example to reproduce it.
>
> If all tasks of one job throw an exception directly and have the following
> options,
> we expect that this job can be retried 5 times with delay 10s each time,
> and the
> Flink job will fail after 50s:
> - restart-strategy : fixed-delay
> - restart-strategy.fixed-delay.attempts : 5
> - restart-strategy.fixed-delay.delay : 10 s
>
> Here is result for 3 cases:
> - case1 is default(streaming job with region failover): Job execution
> failed immediately (It's unexpected)
> - case2 is streaming job with full failover: Job execution failed after 50s
> (5 * 10s) (it's expected)
> - case3 is batch job with region failover: Job execution failed immediately
> (It's unexpected)
>
> And case1 and case3 are unexpected.
> Note: The demo code can be found in FLIP, it's very simple.
>
> Why is this happening? Let me briefly summarize the reasons:
>
> When one task fails, it will call RestartBackoffTimeStrategy#notifyFailure,
> and FixedDelayRestartBackoffTimeStrategy will call currentRestartAttempt++.
>
> All tasks are one region:
> When the first task fails, JobMaster will mark the rest of the tasks as
> ExecutionState#CANCELING.
>
> When rest tasks report exceptions, they will switch state from CANCELING to
> CANCELED instead of FAILED. So rest tasks won't call failed logic
> (Don't increase the currentRestartAttempt).
>
> Why doesn't region-failover work well?
>
> If one job has 100 regions, and every region just has one task. If one task
> fails,
> JobMaster doesn't update other tasks. If rest tasks report exceptions, they
> will
> increase the currentRestartAttempt.
>
> Detailed call stacks and code paths can be found in FLIP.
>
> Please let me know if my understanding is wrong or if it's not clear,
> thanks~
>
> [1] https://cwiki.apache.org/confluence/x/uJqzDw
>
> Best,
> Rui
>
> On Tue, Nov 14, 2023 at 5:10 AM Jing Ge <j...@ververica.com.invalid>
> wrote:
>
> > Hi Rui,
> >
> > Thanks for the proposal! I agree with Zhu that any changes of the default
> > behaviors will have impact on users' jobs in the production environment
> and
> > it would be necessary to have users' attention to to avoid
> > any surprises after upgrading Flink.
> >
> > @Zhu
> > for 1, if we change the default values, we will not only change the
> > behavior of the default restart-strategy but also change the default
> > behavior of exponential-delay strategy for users who are explicitly
> > configured to use the strategy with default values in the production.
> >
> > NIT: @Rui it would be great if you could point out the source code
> > reference where the restartAttempt will be calculated rapidly,
> > i.e. +100 with your one task example. All related docs I could find
> didn't
> > describe it clearly.
> >
> > Best regards,
> > Jing
> >
> >
> > On Mon, Nov 13, 2023 at 11:55 AM Zhu Zhu <reed...@gmail.com> wrote:
> >
> > > Hi Rui,
> > > Thanks for creating this FLIP and sorry for jumping in so late into the
> > > discussion.
> > >
> > > The improvements of exponential-delay strategy and making it the
> default
> > > strategy looks good it me in general. I have some comments for it, as
> > well
> > > as for the failure counting.
> > >
> > > 1. default values of exponential-delay configuration
> > > It is mentioned in the FLIP that "the default value of these options
> are
> > > not changed, we just change them for default restart-strategy. The
> > default
> > > restart-strategy just takes effect if checkpointing is enabled and the
> > > user doesn’t configure the restart-strategy". I'm a bit concerned with
> > > the inconsistency which may cause confusion to users. So if the
> proposed
> > > configuration values work better in most cases, I'm leaning towards to
> > > change the default values.
> > >
> > > 2. the default tolerable failure count
> > > Currently, the restart-strategy behaves like this by default(if not
> > > configured):
> > > 1. job retries on failures indefinitely if checkpointing is enabled
> > > 2. job fails on any failure if checkpointing is disabled
> > > Changing it to always restart indefinitely may result in unexpected
> > > behaviors in production. Therefore, either we should do the same
> > > thing to exponential-delay, or have an open discussion(also involving
> > > the user ML) on changing this default behavior.
> > >
> > > 3. failure counting
> > > Flink currently will try to recognize concurrent failures and group
> them
> > > together, which can be seen in the web UI. So how about to align the
> > > failure counting with the concurrent failures computing? This can make
> it
> > > more consistent and easier for understanding. It will require changes
> to
> > > the concurrent failures computing though, i.e. taking the backoff time
> > > into consideration. So maybe we can open a seperate FLIP for this
> change.
> > >
> > > Thanks,
> > > Zhu
> > >
> > > Rui Fan <1996fan...@gmail.com> 于2023年11月10日周五 18:22写道：
> > >
> > > > I'll start voting next Monday if there isn't any other comment.
> > > >
> > > > Best,
> > > > Rui
> > > >
> > > > On Thu, Oct 19, 2023 at 6:59 PM Rui Fan <1996fan...@gmail.com>
> wrote:
> > > >
> > > > > Hi Konstantin and Max,
> > > > >
> > > > > Thanks for your feedback!
> > > > >
> > > > > Sorry, I forgot to mention the default value of
> > > > >
> > `restart-strategy.exponential-delay.max-attempts-before-reset-backoff`.
> > > > >
> > > > > Retrying forever sounds good to me, I have added it to the FLIP:
> > > > >
> > > > > The default value of
> > > > >
> > `restart-strategy.exponential-delay.max-attempts-before-reset-backoff`
> > > is
> > > > > Integer.MAX_VALUE.
> > > > >
> > > > > Best,
> > > > > Rui
> > > > >
> > > > > On Thu, Oct 19, 2023 at 6:29 PM Maximilian Michels <m...@apache.org
> >
> > > > wrote:
> > > > >
> > > > >> Hey Rui,
> > > > >>
> > > > >> +1 for making exponential backoff the default. I agree with
> > Konstantin
> > > > >> that retrying forever is a good default for exponential backoff
> > > > >> because oftentimes the issue will resolve eventually. The purpose
> of
> > > > >> exponential backoff is precisely to continue to retry without
> > causing
> > > > >> too much load. However, I'm not against adding an optional max
> > number
> > > > >> of retries.
> > > > >>
> > > > >> -Max
> > > > >>
> > > > >> On Thu, Oct 19, 2023 at 11:35 AM Konstantin Knauf <
> > kna...@apache.org>
> > > > >> wrote:
> > > > >> >
> > > > >> > Hi Rui,
> > > > >> >
> > > > >> > Thank you for this proposal and working on this. I also agree
> that
> > > > >> > exponential back off makes sense as a new default in general. I
> > > think
> > > > >> > restarting indefinitely (no max attempts) makes sense by
> default,
> > > > >> though,
> > > > >> > but of course allowing users to change is valuable.
> > > > >> >
> > > > >> > So, overall +1.
> > > > >> >
> > > > >> > Cheers,
> > > > >> >
> > > > >> > Konstantin
> > > > >> >
> > > > >> > Am Di., 17. Okt. 2023 um 07:11 Uhr schrieb Rui Fan <
> > > > >> 1996fan...@gmail.com>:
> > > > >> >
> > > > >> > > Hi all,
> > > > >> > >
> > > > >> > > I would like to start a discussion on FLIP-364: Improve the
> > > > >> > > restart-strategy[1]
> > > > >> > >
> > > > >> > > As we know, the restart-strategy is critical for flink jobs,
> it
> > > > mainly
> > > > >> > > has two functions:
> > > > >> > > 1. When an exception occurs in the flink job, quickly restart
> > the
> > > > job
> > > > >> > > so that the job can return to the running state.
> > > > >> > > 2. When a job cannot be recovered after frequent restarts
> within
> > > > >> > > a certain period of time, Flink will not retry but will fail
> the
> > > > job.
> > > > >> > >
> > > > >> > > The current restart-strategy support for function 2 has some
> > > issues:
> > > > >> > > 1. The exponential-delay doesn't have the max attempts
> > mechanism,
> > > > >> > > it means that flink will restart indefinitely even if it fails
> > > > >> frequently.
> > > > >> > > 2. For multi-region streaming jobs and all batch jobs, the
> > failure
> > > > of
> > > > >> > > each region will increase the total number of job failures by
> > +1,
> > > > >> > > even if these failures occur at the same time. If the number
> of
> > > > >> > > failures increases too quickly, it will be difficult to set a
> > > > >> reasonable
> > > > >> > > number of retries.
> > > > >> > > If the maximum number of failures is set too low, the job can
> > > easily
> > > > >> > > reach the retry limit, causing the job to fail. If set too
> high,
> > > > some
> > > > >> jobs
> > > > >> > > will never fail.
> > > > >> > >
> > > > >> > > In addition, when the above two problems are solved, we can
> also
> > > > >> > > discuss whether exponential-delay can replace fixed-delay as
> the
> > > > >> > > default restart-strategy. In theory, exponential-delay is
> > smarter
> > > > and
> > > > >> > > friendlier than fixed-delay.
> > > > >> > >
> > > > >> > > I also thank Zhu Zhu for his suggestions on the option name in
> > > > >> > > FLINK-32895[2] in advance.
> > > > >> > >
> > > > >> > > Looking forward to and welcome everyone's feedback and
> > > suggestions,
> > > > >> thank
> > > > >> > > you.
> > > > >> > >
> > > > >> > > [1] https://cwiki.apache.org/confluence/x/uJqzDw
> > > > >> > > [2] https://issues.apache.org/jira/browse/FLINK-32895
> > > > >> > >
> > > > >> > > Best,
> > > > >> > > Rui
> > > > >> > >
> > > > >> >
> > > > >> >
> > > > >> > --
> > > > >> > https://twitter.com/snntrable
> > > > >> > https://github.com/knaufk
> > > > >>
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] FLIP-364: Improve the restart-strategy

Reply via email to