Hi Rui,

Thank you for this proposal and working on this. I also agree that
exponential back off makes sense as a new default in general. I think
restarting indefinitely (no max attempts) makes sense by default, though,
but of course allowing users to change is valuable.

So, overall +1.

Cheers,

Konstantin

Am Di., 17. Okt. 2023 um 07:11 Uhr schrieb Rui Fan <1996fan...@gmail.com>:

> Hi all,
>
> I would like to start a discussion on FLIP-364: Improve the
> restart-strategy[1]
>
> As we know, the restart-strategy is critical for flink jobs, it mainly
> has two functions:
> 1. When an exception occurs in the flink job, quickly restart the job
> so that the job can return to the running state.
> 2. When a job cannot be recovered after frequent restarts within
> a certain period of time, Flink will not retry but will fail the job.
>
> The current restart-strategy support for function 2 has some issues:
> 1. The exponential-delay doesn't have the max attempts mechanism,
> it means that flink will restart indefinitely even if it fails frequently.
> 2. For multi-region streaming jobs and all batch jobs, the failure of
> each region will increase the total number of job failures by +1,
> even if these failures occur at the same time. If the number of
> failures increases too quickly, it will be difficult to set a reasonable
> number of retries.
> If the maximum number of failures is set too low, the job can easily
> reach the retry limit, causing the job to fail. If set too high, some jobs
> will never fail.
>
> In addition, when the above two problems are solved, we can also
> discuss whether exponential-delay can replace fixed-delay as the
> default restart-strategy. In theory, exponential-delay is smarter and
> friendlier than fixed-delay.
>
> I also thank Zhu Zhu for his suggestions on the option name in
> FLINK-32895[2] in advance.
>
> Looking forward to and welcome everyone's feedback and suggestions, thank
> you.
>
> [1] https://cwiki.apache.org/confluence/x/uJqzDw
> [2] https://issues.apache.org/jira/browse/FLINK-32895
>
> Best,
> Rui
>


-- 
https://twitter.com/snntrable
https://github.com/knaufk

Reply via email to