Re: [DISCUSS] FLIP-364: Improve the restart-strategy

Rui Fan Tue, 14 Nov 2023 04:31:55 -0800

Thanks a lot Zhu and Jing for the comments!

Regarding concurrent failures mentioned by zhu, I am not familiar with it
before
and need some time to get familiar with it. So I will reply to them later.


I will give Jing an answer first:

> NIT: @Rui it would be great if you could point out the source code
> reference where the restartAttempt will be calculated rapidly,
> i.e. +100 with your one task example. All related docs I could find didn't
> describe it clearly.

I added a new part to explain why the restartAttempt is increased rapidly
in FLIP[1].
And give an example to reproduce it.

If all tasks of one job throw an exception directly and have the following
options,
we expect that this job can be retried 5 times with delay 10s each time,
and the
Flink job will fail after 50s:
- restart-strategy : fixed-delay
- restart-strategy.fixed-delay.attempts : 5
- restart-strategy.fixed-delay.delay : 10 s

Here is result for 3 cases:
- case1 is default(streaming job with region failover): Job execution
failed immediately (It's unexpected)
- case2 is streaming job with full failover: Job execution failed after 50s
(5 * 10s) (it's expected)
- case3 is batch job with region failover: Job execution failed immediately
(It's unexpected)

And case1 and case3 are unexpected.
Note: The demo code can be found in FLIP, it's very simple.

Why is this happening? Let me briefly summarize the reasons:

When one task fails, it will call RestartBackoffTimeStrategy#notifyFailure,
and FixedDelayRestartBackoffTimeStrategy will call currentRestartAttempt++.

All tasks are one region:
When the first task fails, JobMaster will mark the rest of the tasks as
ExecutionState#CANCELING.

When rest tasks report exceptions, they will switch state from CANCELING to
CANCELED instead of FAILED. So rest tasks won't call failed logic
(Don't increase the currentRestartAttempt).

Why doesn't region-failover work well?

If one job has 100 regions, and every region just has one task. If one task
fails,
JobMaster doesn't update other tasks. If rest tasks report exceptions, they
will
increase the currentRestartAttempt.

Detailed call stacks and code paths can be found in FLIP.

Please let me know if my understanding is wrong or if it's not clear,
thanks~

[1] https://cwiki.apache.org/confluence/x/uJqzDw

Best,
Rui

On Tue, Nov 14, 2023 at 5:10 AM Jing Ge <j...@ververica.com.invalid> wrote:

> Hi Rui,
>
> Thanks for the proposal! I agree with Zhu that any changes of the default
> behaviors will have impact on users' jobs in the production environment and
> it would be necessary to have users' attention to to avoid
> any surprises after upgrading Flink.
>
> @Zhu
> for 1, if we change the default values, we will not only change the
> behavior of the default restart-strategy but also change the default
> behavior of exponential-delay strategy for users who are explicitly
> configured to use the strategy with default values in the production.
>
> NIT: @Rui it would be great if you could point out the source code
> reference where the restartAttempt will be calculated rapidly,
> i.e. +100 with your one task example. All related docs I could find didn't
> describe it clearly.
>
> Best regards,
> Jing
>
>
> On Mon, Nov 13, 2023 at 11:55 AM Zhu Zhu <reed...@gmail.com> wrote:
>
> > Hi Rui,
> > Thanks for creating this FLIP and sorry for jumping in so late into the
> > discussion.
> >
> > The improvements of exponential-delay strategy and making it the default
> > strategy looks good it me in general. I have some comments for it, as
> well
> > as for the failure counting.
> >
> > 1. default values of exponential-delay configuration
> > It is mentioned in the FLIP that "the default value of these options are
> > not changed, we just change them for default restart-strategy. The
> default
> > restart-strategy just takes effect if checkpointing is enabled and the
> > user doesn’t configure the restart-strategy". I'm a bit concerned with
> > the inconsistency which may cause confusion to users. So if the proposed
> > configuration values work better in most cases, I'm leaning towards to
> > change the default values.
> >
> > 2. the default tolerable failure count
> > Currently, the restart-strategy behaves like this by default(if not
> > configured):
> > 1. job retries on failures indefinitely if checkpointing is enabled
> > 2. job fails on any failure if checkpointing is disabled
> > Changing it to always restart indefinitely may result in unexpected
> > behaviors in production. Therefore, either we should do the same
> > thing to exponential-delay, or have an open discussion(also involving
> > the user ML) on changing this default behavior.
> >
> > 3. failure counting
> > Flink currently will try to recognize concurrent failures and group them
> > together, which can be seen in the web UI. So how about to align the
> > failure counting with the concurrent failures computing? This can make it
> > more consistent and easier for understanding. It will require changes to
> > the concurrent failures computing though, i.e. taking the backoff time
> > into consideration. So maybe we can open a seperate FLIP for this change.
> >
> > Thanks,
> > Zhu
> >
> > Rui Fan <1996fan...@gmail.com> 于2023年11月10日周五 18:22写道：
> >
> > > I'll start voting next Monday if there isn't any other comment.
> > >
> > > Best,
> > > Rui
> > >
> > > On Thu, Oct 19, 2023 at 6:59 PM Rui Fan <1996fan...@gmail.com> wrote:
> > >
> > > > Hi Konstantin and Max,
> > > >
> > > > Thanks for your feedback!
> > > >
> > > > Sorry, I forgot to mention the default value of
> > > >
> `restart-strategy.exponential-delay.max-attempts-before-reset-backoff`.
> > > >
> > > > Retrying forever sounds good to me, I have added it to the FLIP:
> > > >
> > > > The default value of
> > > >
> `restart-strategy.exponential-delay.max-attempts-before-reset-backoff`
> > is
> > > > Integer.MAX_VALUE.
> > > >
> > > > Best,
> > > > Rui
> > > >
> > > > On Thu, Oct 19, 2023 at 6:29 PM Maximilian Michels <m...@apache.org>
> > > wrote:
> > > >
> > > >> Hey Rui,
> > > >>
> > > >> +1 for making exponential backoff the default. I agree with
> Konstantin
> > > >> that retrying forever is a good default for exponential backoff
> > > >> because oftentimes the issue will resolve eventually. The purpose of
> > > >> exponential backoff is precisely to continue to retry without
> causing
> > > >> too much load. However, I'm not against adding an optional max
> number
> > > >> of retries.
> > > >>
> > > >> -Max
> > > >>
> > > >> On Thu, Oct 19, 2023 at 11:35 AM Konstantin Knauf <
> kna...@apache.org>
> > > >> wrote:
> > > >> >
> > > >> > Hi Rui,
> > > >> >
> > > >> > Thank you for this proposal and working on this. I also agree that
> > > >> > exponential back off makes sense as a new default in general. I
> > think
> > > >> > restarting indefinitely (no max attempts) makes sense by default,
> > > >> though,
> > > >> > but of course allowing users to change is valuable.
> > > >> >
> > > >> > So, overall +1.
> > > >> >
> > > >> > Cheers,
> > > >> >
> > > >> > Konstantin
> > > >> >
> > > >> > Am Di., 17. Okt. 2023 um 07:11 Uhr schrieb Rui Fan <
> > > >> 1996fan...@gmail.com>:
> > > >> >
> > > >> > > Hi all,
> > > >> > >
> > > >> > > I would like to start a discussion on FLIP-364: Improve the
> > > >> > > restart-strategy[1]
> > > >> > >
> > > >> > > As we know, the restart-strategy is critical for flink jobs, it
> > > mainly
> > > >> > > has two functions:
> > > >> > > 1. When an exception occurs in the flink job, quickly restart
> the
> > > job
> > > >> > > so that the job can return to the running state.
> > > >> > > 2. When a job cannot be recovered after frequent restarts within
> > > >> > > a certain period of time, Flink will not retry but will fail the
> > > job.
> > > >> > >
> > > >> > > The current restart-strategy support for function 2 has some
> > issues:
> > > >> > > 1. The exponential-delay doesn't have the max attempts
> mechanism,
> > > >> > > it means that flink will restart indefinitely even if it fails
> > > >> frequently.
> > > >> > > 2. For multi-region streaming jobs and all batch jobs, the
> failure
> > > of
> > > >> > > each region will increase the total number of job failures by
> +1,
> > > >> > > even if these failures occur at the same time. If the number of
> > > >> > > failures increases too quickly, it will be difficult to set a
> > > >> reasonable
> > > >> > > number of retries.
> > > >> > > If the maximum number of failures is set too low, the job can
> > easily
> > > >> > > reach the retry limit, causing the job to fail. If set too high,
> > > some
> > > >> jobs
> > > >> > > will never fail.
> > > >> > >
> > > >> > > In addition, when the above two problems are solved, we can also
> > > >> > > discuss whether exponential-delay can replace fixed-delay as the
> > > >> > > default restart-strategy. In theory, exponential-delay is
> smarter
> > > and
> > > >> > > friendlier than fixed-delay.
> > > >> > >
> > > >> > > I also thank Zhu Zhu for his suggestions on the option name in
> > > >> > > FLINK-32895[2] in advance.
> > > >> > >
> > > >> > > Looking forward to and welcome everyone's feedback and
> > suggestions,
> > > >> thank
> > > >> > > you.
> > > >> > >
> > > >> > > [1] https://cwiki.apache.org/confluence/x/uJqzDw
> > > >> > > [2] https://issues.apache.org/jira/browse/FLINK-32895
> > > >> > >
> > > >> > > Best,
> > > >> > > Rui
> > > >> > >
> > > >> >
> > > >> >
> > > >> > --
> > > >> > https://twitter.com/snntrable
> > > >> > https://github.com/knaufk
> > > >>
> > > >
> > >
> >
>

Re: [DISCUSS] FLIP-364: Improve the restart-strategy

Reply via email to