[ https://issues.apache.org/jira/browse/FLINK-20752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chesnay Schepler closed FLINK-20752. ------------------------------------ Release Note: The Failure Rate Restart Strategy was allowing 1 less restart per interval as configured. Users wishing to keep the current behavior should reduce the maximum number of allowed failures per interval by 1. (was: The Failure Rate Restart Strategy was allowing 1 less restart per interval as configured. Users wishing to retain the current behavior should reduce the maximum number of allowed failures per interval by 1.) Resolution: Fixed master: 342ec9b9e7e4021b5fb90220d098590d80620005 > FailureRateRestartBackoffTimeStrategy allows one less restart than configured > ----------------------------------------------------------------------------- > > Key: FLINK-20752 > URL: https://issues.apache.org/jira/browse/FLINK-20752 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination > Affects Versions: 1.10.0 > Reporter: Chesnay Schepler > Assignee: Chesnay Schepler > Priority: Minor > Labels: pull-request-available > Fix For: 1.13.0 > > > The {{FailureRateRestartBackoffTimeStrategy}} maintains a list of failure > timestamps, keeping N timestamps where N is the configured of failures per > interval. > The timestamp is added when #notifyFailure() is called, and later evaluated > within #canRestart(). > To determine whether a restart should be allowed we first check whether we > are already storing N timestamps, and if so check whether the earliest > failure still falls within the current interval. If it does, we reject the > restart. > The problem is that we check whether we have already stored exactly N > timestamps. If we have exactly N timestamps, and we allow N failures per > interval, then we should not reject the restart by definition. We should > instead be checking whether N+1 timestamps have been stored. > For example, let's say we allow 2 exceptions, and 2 have occurred so far. > Regardless of what the timestamps are, we should still allow a restart in > this case. > Only once a third exception occurs should we be looking at the timestamps, > and we should furthermore only look at the exception exceeding the allowed > failure count; in this example it is the very first exception. > I don't know why this went unnoticed for so long; the relevant tests fail > rather reliably for me locally. > ({{FailureRateRestartBackoffTimeStrategyTest}}, > {{SimpleRecoveryFailureRateStrategyITBase}}) > -- This message was sent by Atlassian Jira (v8.3.4#803005)