Hi Mingliang: Thanks you for the feedback here!
Glad to hear Netflix have made exponential-delay as the default restart strategy. Our production(Shopee) also makes exponential-delay as the default since May 2021, and the current number of flink jobs far exceeds tens of thousands. These jobs work well. Note: Our internal exponential-delay solves the problem of a large number of tasks failing in a short period of time causing restartAttempts to increase rapidly. Based on your production, do you have any suggestions about default values of exponential-delay configuration? Zhu and Jing may also be interested in this question. Following are FLIP-364 proposed default values: restart-strategy.exponential-delay.max-attempts-before-reset-backoff : Integer.MAX_VALUE restart-strategy.exponential-delay.initial-backoff : 1s restart-strategy.exponential-delay.backoff-multiplier : 1.2 restart-strategy.exponential-delay.jitter-factor : 0.1 restart-strategy.exponential-delay.max-backoff : 1 min restart-strategy.exponential-delay.reset-backoff-threshold : 1h Looking forward to your feedback! And I will start a discussion on user mail list to collect more feedback. In addition, I understand that the community needs to consider a lot of compatibility and risks when modifying the default value. If this is very difficult to reach consensus on, I can remove this item from FLIP. Best, Rui On Wed, Nov 15, 2023 at 6:40 AM Mingliang Liu <lium...@apache.org> wrote: > Thanks Rui for driving this. I just call out that making exponential-delay > the default is a good change. At Netflix, we have enabled this as the > default restart strategy 2 quarters ago and it has been working well. > Keeping it restarting indefinitely by default makes sense to me. > > On Mon, Oct 16, 2023 at 10:11 PM Rui Fan <1996fan...@gmail.com> wrote: > > > Hi all, > > > > I would like to start a discussion on FLIP-364: Improve the > > restart-strategy[1] > > > > As we know, the restart-strategy is critical for flink jobs, it mainly > > has two functions: > > 1. When an exception occurs in the flink job, quickly restart the job > > so that the job can return to the running state. > > 2. When a job cannot be recovered after frequent restarts within > > a certain period of time, Flink will not retry but will fail the job. > > > > The current restart-strategy support for function 2 has some issues: > > 1. The exponential-delay doesn't have the max attempts mechanism, > > it means that flink will restart indefinitely even if it fails > frequently. > > 2. For multi-region streaming jobs and all batch jobs, the failure of > > each region will increase the total number of job failures by +1, > > even if these failures occur at the same time. If the number of > > failures increases too quickly, it will be difficult to set a reasonable > > number of retries. > > If the maximum number of failures is set too low, the job can easily > > reach the retry limit, causing the job to fail. If set too high, some > jobs > > will never fail. > > > > In addition, when the above two problems are solved, we can also > > discuss whether exponential-delay can replace fixed-delay as the > > default restart-strategy. In theory, exponential-delay is smarter and > > friendlier than fixed-delay. > > > > I also thank Zhu Zhu for his suggestions on the option name in > > FLINK-32895[2] in advance. > > > > Looking forward to and welcome everyone's feedback and suggestions, thank > > you. > > > > [1] https://cwiki.apache.org/confluence/x/uJqzDw > > [2] https://issues.apache.org/jira/browse/FLINK-32895 > > > > Best, > > Rui > > >