Hi Mingliang:

Thanks you for the feedback here!

Glad to hear Netflix have made exponential-delay as the
default restart strategy. Our production(Shopee) also makes
exponential-delay as the default since May 2021, and the
current number of flink jobs far exceeds tens of thousands.
These jobs work well.

Note: Our internal exponential-delay solves the problem
of a large number of tasks failing in a short period of time
causing restartAttempts to increase rapidly.

Based on your production, do you have any suggestions
about default values of exponential-delay configuration?

Zhu and Jing may also be interested in this question.

Following are FLIP-364 proposed default values:

restart-strategy.exponential-delay.max-attempts-before-reset-backoff :
Integer.MAX_VALUE
restart-strategy.exponential-delay.initial-backoff : 1s
restart-strategy.exponential-delay.backoff-multiplier : 1.2
restart-strategy.exponential-delay.jitter-factor : 0.1
restart-strategy.exponential-delay.max-backoff : 1 min
restart-strategy.exponential-delay.reset-backoff-threshold : 1h

Looking forward to your feedback! And I will start a discussion
on user mail list to collect more feedback.

In addition, I understand that the community needs to consider
a lot of compatibility and risks when modifying the default value.
If this is very difficult to reach consensus on, I can remove
this item from FLIP.

Best,
Rui

On Wed, Nov 15, 2023 at 6:40 AM Mingliang Liu <lium...@apache.org> wrote:

> Thanks Rui for driving this. I just call out that making exponential-delay
> the default is a good change. At Netflix, we have enabled this as the
> default restart strategy 2 quarters ago and it has been working well.
> Keeping it restarting indefinitely by default makes sense to me.
>
> On Mon, Oct 16, 2023 at 10:11 PM Rui Fan <1996fan...@gmail.com> wrote:
>
> > Hi all,
> >
> > I would like to start a discussion on FLIP-364: Improve the
> > restart-strategy[1]
> >
> > As we know, the restart-strategy is critical for flink jobs, it mainly
> > has two functions:
> > 1. When an exception occurs in the flink job, quickly restart the job
> > so that the job can return to the running state.
> > 2. When a job cannot be recovered after frequent restarts within
> > a certain period of time, Flink will not retry but will fail the job.
> >
> > The current restart-strategy support for function 2 has some issues:
> > 1. The exponential-delay doesn't have the max attempts mechanism,
> > it means that flink will restart indefinitely even if it fails
> frequently.
> > 2. For multi-region streaming jobs and all batch jobs, the failure of
> > each region will increase the total number of job failures by +1,
> > even if these failures occur at the same time. If the number of
> > failures increases too quickly, it will be difficult to set a reasonable
> > number of retries.
> > If the maximum number of failures is set too low, the job can easily
> > reach the retry limit, causing the job to fail. If set too high, some
> jobs
> > will never fail.
> >
> > In addition, when the above two problems are solved, we can also
> > discuss whether exponential-delay can replace fixed-delay as the
> > default restart-strategy. In theory, exponential-delay is smarter and
> > friendlier than fixed-delay.
> >
> > I also thank Zhu Zhu for his suggestions on the option name in
> > FLINK-32895[2] in advance.
> >
> > Looking forward to and welcome everyone's feedback and suggestions, thank
> > you.
> >
> > [1] https://cwiki.apache.org/confluence/x/uJqzDw
> > [2] https://issues.apache.org/jira/browse/FLINK-32895
> >
> > Best,
> > Rui
> >
>

Reply via email to