[ https://issues.apache.org/jira/browse/FLINK-32895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17763394#comment-17763394 ]
Rui Fan commented on FLINK-32895: --------------------------------- Hi [~zhuzh] , I created the FLIP-364 in advance due to I found several points in the restart strategy that need to be improved. We can discuss them in the mail list in the future. There are 2 option for discussion: * Option1: Start discuss FLIP-364 after deprecating the RestartStrategies is discussed. * Option2: FLIP-364 has serveral points need to be discussed, we can first discuss other parts of FLIP-364 besides RestartStrategies. And the RestartStrategies part can be followed by your separate FLIP. WDYT? BTW, after some more thought: restart-strategy.exponential-delay.fail-on-exceeding-max-backoff may not work well. Because the user may want to restart this job multiple times using max-backoff before failing it. For example, users don't want the delay-time to be too long, so they set the initial-backoff=1s, backoff-multiplier=2, max-backoff=30s. So the delay time is 1s, 2s, 4s, 8s, 16s, 30s, 30s, 30s, 30s, 30s, etc. If we introduced the `fail-on-exceeding-max-backoff`, it means that the job won't restart when the delay-time is extended to 30s at first time. right? Please correct me if I'm wrong, and looking forward to more feedbacks from community, thanks~ [1]https://cwiki.apache.org/confluence/display/FLINK/FLIP-364%3A+Improve+the+restart-strategy > Introduce the max attempts for Exponential Delay Restart Strategy > ----------------------------------------------------------------- > > Key: FLINK-32895 > URL: https://issues.apache.org/jira/browse/FLINK-32895 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination > Reporter: Rui Fan > Assignee: Rui Fan > Priority: Major > Labels: pull-request-available > > Currently, Flink has 3 restart strategies, they are: fixed-delay, > failure-rate and exponential-delay. > The exponential-delay is suitable if a job continues to fail for a period of > time. The fixed-delay and failure-rate has the max attempts mechanism, that > means, the job won't restart and go to fail after the attempt exceeds the > threshold of max attempts. > The max attempts mechanism is reasonable, flink should not or need to > infinitely restart the job if the job keeps failing. However, the > exponential-delay doesn't have the max attempts mechanism. > I propose introducing the > `restart-strategy.exponential-delay.max-attempts-before-reset` to support the > max attempts mechanism for exponential-delay. It means flink won't restart > job if the number of job failures before reset exceeds > max-attempts-before-reset when is exponential-delay is enabled. -- This message was sent by Atlassian Jira (v8.20.10#820010)