[ 
https://issues.apache.org/jira/browse/FLINK-32895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17763394#comment-17763394
 ] 

Rui Fan commented on FLINK-32895:
---------------------------------

Hi [~zhuzh] , I created the FLIP-364 in advance due to I found several points 
in the restart strategy that need to be improved. We can discuss them in the 
mail list in the future.

There are 2 option for discussion:
 * Option1: Start discuss FLIP-364 after deprecating the RestartStrategies is 
discussed.
 * Option2: FLIP-364 has serveral points need to be discussed, we can first 
discuss other parts of FLIP-364 besides RestartStrategies. And the 
RestartStrategies part can be followed by your separate FLIP. 

WDYT?

BTW, after some more thought: 
restart-strategy.exponential-delay.fail-on-exceeding-max-backoff may not work 
well. Because the user may want to restart this job multiple times using 
max-backoff before failing it.

For example, users don't want the delay-time to be too long, so they set the 
initial-backoff=1s, backoff-multiplier=2, max-backoff=30s. So the delay time is 
1s, 2s, 4s, 8s, 16s, 30s, 30s, 30s, 30s, 30s, etc. If we introduced the 
`fail-on-exceeding-max-backoff`, it means that the job won't restart when the 
delay-time is extended to 30s at first time. right?

Please correct me if I'm wrong, and looking forward to more feedbacks from 
community, thanks~

 

[1]https://cwiki.apache.org/confluence/display/FLINK/FLIP-364%3A+Improve+the+restart-strategy

> Introduce the max attempts for Exponential Delay Restart Strategy
> -----------------------------------------------------------------
>
>                 Key: FLINK-32895
>                 URL: https://issues.apache.org/jira/browse/FLINK-32895
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Coordination
>            Reporter: Rui Fan
>            Assignee: Rui Fan
>            Priority: Major
>              Labels: pull-request-available
>
> Currently, Flink has 3 restart strategies, they are: fixed-delay, 
> failure-rate and exponential-delay.
> The exponential-delay is suitable if a job continues to fail for a period of 
> time. The fixed-delay and failure-rate has the max attempts mechanism, that 
> means, the job won't restart and go to fail after the attempt exceeds the 
> threshold of max attempts. 
> The max attempts mechanism is reasonable, flink should not or need to 
> infinitely restart the job if the job keeps failing. However, the 
> exponential-delay doesn't have the max attempts mechanism.
> I propose introducing the 
> `restart-strategy.exponential-delay.max-attempts-before-reset` to support the 
> max attempts mechanism for exponential-delay. It means flink won't restart 
> job if the number of job failures before reset exceeds 
> max-attempts-before-reset when is exponential-delay is enabled.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to