Re: [DISCUSS] Simplify Flink's cluster level RestartStrategy configuration

Till Rohrmann Fri, 30 Aug 2019 06:24:28 -0700

The current default behaviour for batch is `NoRestartStrategy` if nothing
is configured. We could say that we set the default value of
`restart-strategy` to `FixedDelayRestartStrategy(Integer.MAX_VALUE, "0 s")`
independent of the checkpointing. The only downside I could see is that
some faulty batch jobs might get stuck in a restart loop without reaching a
terminal state.


@Dawid, I don't intend to touch the ExecutionConfig. This change only
targets the cluster level configuration of the RestartStrategy.

Cheers,
Till

On Fri, Aug 30, 2019 at 3:14 PM Dawid Wysakowicz <dwysakow...@apache.org>
wrote:

> Also +1 in general.
>
> I have a few questions though:
>
> - does it only apply to the logic in
>
> org.apache.flink.runtime.executiongraph.restart.RestartStrategyFactory#createRestartStrategyFactory,
> which is only the cluster side configuration? Or do you want to change
> the logic also on the job side in ExecutionConfig?
>
> - if the latter, does that mean deprecated methods in ExecutionConfig
> like: setNumberOfExecutionRetries, setExecutionRetryDelay will have no
> effect? I think this would be a good idea, but would suggest to remove
> the corresponding fields and methods. This is not that simple though. I
> tried to do that for other parameters that have no effect already like
> codeAnalysisMode & failTaskOnCheckpointError. The are two problems:
>
>     1) setNumberOfExecutionRetires are effectively marked with @Public
> annotation (the codeAnalysisMode & failTaskOnCheckpointError don't have
> this problem). Therefore this would be a binary incompatible change.
>
>     2) ExecutionConfig is stored in state as part of PojoSerializer in
> pre flink 1.7. It should not be a problem for numberOfExecutionRetries &
> executionRetryDelays as they are of primitive types. It is a problem for
> codeAnalysisMode (we cannot remove the class, as this breaks
> serialization). I wanted to mention that anyway, just to be aware of that.
>
> Best,
>
> Dawid
>
> On 30/08/2019 14:48, Stephan Ewen wrote:
> > +1 in general
> >
> > What is the default in batch, though? No restarts? I always found that
> > somewhat uncommon.
> > Should we also change that part, if we are changing the default anyways?
> >
> >
> > On Fri, Aug 30, 2019 at 2:35 PM Till Rohrmann <trohrm...@apache.org>
> wrote:
> >
> >> Hi everyone,
> >>
> >> I wanted to discuss how to simplify Flink's cluster level
> RestartStrategy
> >> configuration [1]. Currently, Flink's behaviour with respect to
> configuring
> >> the {{RestartStrategies}} is quite complicated and convoluted. The
> reason
> >> for this is that we evolved the way it has been configured and wanted to
> >> keep it backwards compatible. Due to this, we have currently the
> following
> >> behaviour:
> >>
> >> * If the config option `restart-strategy` is configured, then Flink uses
> >> this `RestartStrategy` (so far so simple)
> >> * If the config option `restart-strategy` is not configured, then
> >> ** If `restart-strategy.fixed-delay.attempts` or
> >> `restart-strategy.fixed-delay.delay` are defined, then instantiate
> >> `FixedDelayRestartStrategy(restart-strategy.fixed-delay.attempts,
> >> restart-strategy.fixed-delay.delay)`
> >> ** If `restart-strategy.fixed-delay.attempts` and
> >> `restart-strategy.fixed-delay.delay` are not defined, then
> >> *** If checkpointing is disabled, then choose `NoRestartStrategy`
> >> *** If checkpointing is enabled, then choose
> >> `FixedDelayRestartStrategy(Integer.MAX_VALUE, "0 s")`
> >>
> >> I would like to simplify the configuration by removing the "If
> >> `restart-strategy.fixed-delay.attempts` or
> >> `restart-strategy.fixed-delay.delay`, then" condition. That way, the
> logic
> >> would be the following:
> >>
> >> * If the config option `restart-strategy` is configured, then Flink uses
> >> this `RestartStrategy`
> >> * If the config option `restart-strategy` is not configured, then
> >> ** If checkpointing is disabled, then choose `NoRestartStrategy`
> >> ** If checkpointing is enabled, then choose
> >> `FixedDelayRestartStrategy(Integer.MAX_VALUE, "0 s")`
> >>
> >> That way we retain the user friendliness that jobs restart if the user
> >> enabled checkpointing and we make it clear that any `
> >> restart-strategy.fixed-delay.xyz` setting will only be respected if
> >> `restart-strategy` has been set to `fixed-delay`.
> >>
> >> This simplification would, however, change Flink's behaviour and might
> >> break existing setups. Since we introduced `RestartStrategies` with
> Flink
> >> 1.0.0 and deprecated the prior configuration mechanism which enables
> >> restarting if either the `attempts` or the `delay` has been set, I think
> >> that the number of broken jobs should be minimal if not non-existent.
> >>
> >> I'm sure that one can simplify the way RestartStrategies are
> >> programmatically configured as well but for the sake of
> simplicity/scoping
> >> I'd like to not touch it right away.
> >>
> >> What do you think about this behaviour change?
> >>
> >> [1] https://issues.apache.org/jira/browse/FLINK-13921
> >>
> >> Cheers,
> >> Till
> >>
>
>

Re: [DISCUSS] Simplify Flink's cluster level RestartStrategy configuration

Reply via email to