[ 
https://issues.apache.org/jira/browse/FLINK-2066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944732#comment-14944732
 ] 

ASF GitHub Bot commented on FLINK-2066:
---------------------------------------

Github user StephanEwen commented on a diff in the pull request:

    https://github.com/apache/flink/pull/1223#discussion_r41239329
  
    --- Diff: docs/apis/programming_guide.md ---
    @@ -1992,6 +1992,8 @@ With the closure cleaner disabled, it might happen 
that an anonymous user functi
     
     - `getNumberOfExecutionRetries()` / `setNumberOfExecutionRetries(int 
numberOfExecutionRetries)` Sets the number of times that failed tasks are 
re-executed. A value of zero effectively disables fault tolerance. A value of 
`-1` indicates that the system default value (as defined in the configuration) 
should be used.
     
    +- `getExecutionRetryDelay()` / `setExecutionRetryDelay(long 
executionRetryDelay)` Sets the delay that failed tasks are re-executed. A value 
of `-1` indicates that the default value should be used.
    --- End diff --
    
    I think this is a critical parameter, so I would like to extend the 
description a bit. How about this:
    
    ```
    Sets the delay that the system waits after a job has failed, before 
re-executing it. The delay starts after all tasks have been successfully been 
stopped on the TaskManagers, and once the delay is past, the tasks are 
re-started. This parameter is useful to delay re-execution in order to let 
certain time-out related failures surface fully (like broken connections that 
have not fully timed out), before attempting a re-execution and immediately 
failing again due to the same problem.  
    
    This parameter only has an effect if the number of execution re-tries is 
one or more.
    ```


> Make delay between execution retries configurable
> -------------------------------------------------
>
>                 Key: FLINK-2066
>                 URL: https://issues.apache.org/jira/browse/FLINK-2066
>             Project: Flink
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.9, 0.10
>            Reporter: Stephan Ewen
>            Assignee: Nuno Miguel Marques dos Santos
>            Priority: Blocker
>              Labels: starter
>             Fix For: 0.10
>
>
> Flink allows to specify a delay between execution retries. This helps to let 
> some external failure causes fully manifest themselves before the restart is 
> attempted.
> The delay is currently defined only system wide.
> We should add it to the {{ExecutionConfig}} of a job to allow per-job 
> specification.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to