[jira] [Commented] (FLINK-10074) Allowable number of checkpoint failures

vinoyang (JIRA) Thu, 09 Aug 2018 08:51:06 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-10074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16575038#comment-16575038
 ]


vinoyang commented on FLINK-10074:
----------------------------------

[~till.rohrmann] yes, I agree with you. If we focus on time, it will become 
more complicated for users, because there are multiple time-related 
configurations that need to understand some details. And if we focus on the 
number of times, it will be more user friendly, as if the maximum number of 
timeouts and failures.

> Allowable number of checkpoint failures 
> ----------------------------------------
>
>                 Key: FLINK-10074
>                 URL: https://issues.apache.org/jira/browse/FLINK-10074
>             Project: Flink
>          Issue Type: Improvement
>          Components: State Backends, Checkpointing
>            Reporter: Thomas Weise
>            Assignee: vinoyang
>            Priority: Major
>
> For intermittent checkpoint failures it is desirable to have a mechanism to 
> avoid restarts. If, for example, a transient S3 error prevents checkpoint 
> completion, the next checkpoint may very well succeed. The user may wish to 
> not incur the expense of restart under such scenario and this could be 
> expressed with a failure threshold (number of subsequent checkpoint 
> failures), possibly combined with a list of exceptions to tolerate.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (FLINK-10074) Allowable number of checkpoint failures

Reply via email to