+1 to simplify the RestartStrategy configuration One thing to confirm is whether the default delay should be "0 s" in the case of "If the config option `restart-strategy` is not configured" and "If checkpointing is enabled". I see a related discussion([SURVEY] Is the default restart delay of 0s causing problems) is ongoing and we may need to take the result from that.
Thanks, Zhu Zhu Becket Qin <becket....@gmail.com> 于2019年9月2日周一 上午9:06写道: > +1. The new behavior makes sense to me. > > BTW, we need a FLIP for this :) > > On Fri, Aug 30, 2019 at 10:17 PM Till Rohrmann <trohrm...@apache.org> > wrote: > > > After an offline discussion with Stephan, we concluded that changing the > > default restart strategy for batch jobs is not that easy because the > > cluster level restart configuration does not necessarily know about the > > type of job which is submitted. We concluded that we would like to keep > the > > batch behaviour as is (NoRestartStrategy) and revisit this issue at a > later > > point in time. > > > > On Fri, Aug 30, 2019 at 3:24 PM Till Rohrmann <trohrm...@apache.org> > > wrote: > > > > > The current default behaviour for batch is `NoRestartStrategy` if > nothing > > > is configured. We could say that we set the default value of > > > `restart-strategy` to `FixedDelayRestartStrategy(Integer.MAX_VALUE, "0 > > s")` > > > independent of the checkpointing. The only downside I could see is that > > > some faulty batch jobs might get stuck in a restart loop without > > reaching a > > > terminal state. > > > > > > @Dawid, I don't intend to touch the ExecutionConfig. This change only > > > targets the cluster level configuration of the RestartStrategy. > > > > > > Cheers, > > > Till > > > > > > On Fri, Aug 30, 2019 at 3:14 PM Dawid Wysakowicz < > dwysakow...@apache.org > > > > > > wrote: > > > > > >> Also +1 in general. > > >> > > >> I have a few questions though: > > >> > > >> - does it only apply to the logic in > > >> > > >> > > > org.apache.flink.runtime.executiongraph.restart.RestartStrategyFactory#createRestartStrategyFactory, > > >> which is only the cluster side configuration? Or do you want to change > > >> the logic also on the job side in ExecutionConfig? > > >> > > >> - if the latter, does that mean deprecated methods in ExecutionConfig > > >> like: setNumberOfExecutionRetries, setExecutionRetryDelay will have no > > >> effect? I think this would be a good idea, but would suggest to remove > > >> the corresponding fields and methods. This is not that simple though. > I > > >> tried to do that for other parameters that have no effect already like > > >> codeAnalysisMode & failTaskOnCheckpointError. The are two problems: > > >> > > >> 1) setNumberOfExecutionRetires are effectively marked with @Public > > >> annotation (the codeAnalysisMode & failTaskOnCheckpointError don't > have > > >> this problem). Therefore this would be a binary incompatible change. > > >> > > >> 2) ExecutionConfig is stored in state as part of PojoSerializer in > > >> pre flink 1.7. It should not be a problem for > numberOfExecutionRetries & > > >> executionRetryDelays as they are of primitive types. It is a problem > for > > >> codeAnalysisMode (we cannot remove the class, as this breaks > > >> serialization). I wanted to mention that anyway, just to be aware of > > that. > > >> > > >> Best, > > >> > > >> Dawid > > >> > > >> On 30/08/2019 14:48, Stephan Ewen wrote: > > >> > +1 in general > > >> > > > >> > What is the default in batch, though? No restarts? I always found > that > > >> > somewhat uncommon. > > >> > Should we also change that part, if we are changing the default > > anyways? > > >> > > > >> > > > >> > On Fri, Aug 30, 2019 at 2:35 PM Till Rohrmann <trohrm...@apache.org > > > > >> wrote: > > >> > > > >> >> Hi everyone, > > >> >> > > >> >> I wanted to discuss how to simplify Flink's cluster level > > >> RestartStrategy > > >> >> configuration [1]. Currently, Flink's behaviour with respect to > > >> configuring > > >> >> the {{RestartStrategies}} is quite complicated and convoluted. The > > >> reason > > >> >> for this is that we evolved the way it has been configured and > wanted > > >> to > > >> >> keep it backwards compatible. Due to this, we have currently the > > >> following > > >> >> behaviour: > > >> >> > > >> >> * If the config option `restart-strategy` is configured, then Flink > > >> uses > > >> >> this `RestartStrategy` (so far so simple) > > >> >> * If the config option `restart-strategy` is not configured, then > > >> >> ** If `restart-strategy.fixed-delay.attempts` or > > >> >> `restart-strategy.fixed-delay.delay` are defined, then instantiate > > >> >> `FixedDelayRestartStrategy(restart-strategy.fixed-delay.attempts, > > >> >> restart-strategy.fixed-delay.delay)` > > >> >> ** If `restart-strategy.fixed-delay.attempts` and > > >> >> `restart-strategy.fixed-delay.delay` are not defined, then > > >> >> *** If checkpointing is disabled, then choose `NoRestartStrategy` > > >> >> *** If checkpointing is enabled, then choose > > >> >> `FixedDelayRestartStrategy(Integer.MAX_VALUE, "0 s")` > > >> >> > > >> >> I would like to simplify the configuration by removing the "If > > >> >> `restart-strategy.fixed-delay.attempts` or > > >> >> `restart-strategy.fixed-delay.delay`, then" condition. That way, > the > > >> logic > > >> >> would be the following: > > >> >> > > >> >> * If the config option `restart-strategy` is configured, then Flink > > >> uses > > >> >> this `RestartStrategy` > > >> >> * If the config option `restart-strategy` is not configured, then > > >> >> ** If checkpointing is disabled, then choose `NoRestartStrategy` > > >> >> ** If checkpointing is enabled, then choose > > >> >> `FixedDelayRestartStrategy(Integer.MAX_VALUE, "0 s")` > > >> >> > > >> >> That way we retain the user friendliness that jobs restart if the > > user > > >> >> enabled checkpointing and we make it clear that any ` > > >> >> restart-strategy.fixed-delay.xyz` setting will only be respected > if > > >> >> `restart-strategy` has been set to `fixed-delay`. > > >> >> > > >> >> This simplification would, however, change Flink's behaviour and > > might > > >> >> break existing setups. Since we introduced `RestartStrategies` with > > >> Flink > > >> >> 1.0.0 and deprecated the prior configuration mechanism which > enables > > >> >> restarting if either the `attempts` or the `delay` has been set, I > > >> think > > >> >> that the number of broken jobs should be minimal if not > non-existent. > > >> >> > > >> >> I'm sure that one can simplify the way RestartStrategies are > > >> >> programmatically configured as well but for the sake of > > >> simplicity/scoping > > >> >> I'd like to not touch it right away. > > >> >> > > >> >> What do you think about this behaviour change? > > >> >> > > >> >> [1] https://issues.apache.org/jira/browse/FLINK-13921 > > >> >> > > >> >> Cheers, > > >> >> Till > > >> >> > > >> > > >> > > >