Hi Rui, Thank you for this proposal and working on this. I also agree that exponential back off makes sense as a new default in general. I think restarting indefinitely (no max attempts) makes sense by default, though, but of course allowing users to change is valuable.
So, overall +1. Cheers, Konstantin Am Di., 17. Okt. 2023 um 07:11 Uhr schrieb Rui Fan <1996fan...@gmail.com>: > Hi all, > > I would like to start a discussion on FLIP-364: Improve the > restart-strategy[1] > > As we know, the restart-strategy is critical for flink jobs, it mainly > has two functions: > 1. When an exception occurs in the flink job, quickly restart the job > so that the job can return to the running state. > 2. When a job cannot be recovered after frequent restarts within > a certain period of time, Flink will not retry but will fail the job. > > The current restart-strategy support for function 2 has some issues: > 1. The exponential-delay doesn't have the max attempts mechanism, > it means that flink will restart indefinitely even if it fails frequently. > 2. For multi-region streaming jobs and all batch jobs, the failure of > each region will increase the total number of job failures by +1, > even if these failures occur at the same time. If the number of > failures increases too quickly, it will be difficult to set a reasonable > number of retries. > If the maximum number of failures is set too low, the job can easily > reach the retry limit, causing the job to fail. If set too high, some jobs > will never fail. > > In addition, when the above two problems are solved, we can also > discuss whether exponential-delay can replace fixed-delay as the > default restart-strategy. In theory, exponential-delay is smarter and > friendlier than fixed-delay. > > I also thank Zhu Zhu for his suggestions on the option name in > FLINK-32895[2] in advance. > > Looking forward to and welcome everyone's feedback and suggestions, thank > you. > > [1] https://cwiki.apache.org/confluence/x/uJqzDw > [2] https://issues.apache.org/jira/browse/FLINK-32895 > > Best, > Rui > -- https://twitter.com/snntrable https://github.com/knaufk