Re: [DISCUSS] Simplify Flink's cluster level RestartStrategy configuration

Zhu Zhu Sun, 01 Sep 2019 20:19:10 -0700

+1 to simplify the RestartStrategy configuration

One thing to confirm is whether the default delay should be "0 s" in the
case of
"If the config option `restart-strategy` is not configured" and "If
checkpointing is enabled".
I see a related discussion([SURVEY] Is the default restart delay of 0s
causing problems) is ongoing and we may need to take the result from that.


Thanks,
Zhu Zhu

Becket Qin <[email protected]> 于2019年9月2日周一 上午9:06写道：

> +1. The new behavior makes sense to me.
>
> BTW, we need a FLIP for this :)
>
> On Fri, Aug 30, 2019 at 10:17 PM Till Rohrmann <[email protected]>
> wrote:
>
> > After an offline discussion with Stephan, we concluded that changing the
> > default restart strategy for batch jobs is not that easy because the
> > cluster level restart configuration does not necessarily know about the
> > type of job which is submitted. We concluded that we would like to keep
> the
> > batch behaviour as is (NoRestartStrategy) and revisit this issue at a
> later
> > point in time.
> >
> > On Fri, Aug 30, 2019 at 3:24 PM Till Rohrmann <[email protected]>
> > wrote:
> >
> > > The current default behaviour for batch is `NoRestartStrategy` if
> nothing
> > > is configured. We could say that we set the default value of
> > > `restart-strategy` to `FixedDelayRestartStrategy(Integer.MAX_VALUE, "0
> > s")`
> > > independent of the checkpointing. The only downside I could see is that
> > > some faulty batch jobs might get stuck in a restart loop without
> > reaching a
> > > terminal state.
> > >
> > > @Dawid, I don't intend to touch the ExecutionConfig. This change only
> > > targets the cluster level configuration of the RestartStrategy.
> > >
> > > Cheers,
> > > Till
> > >
> > > On Fri, Aug 30, 2019 at 3:14 PM Dawid Wysakowicz <
> [email protected]
> > >
> > > wrote:
> > >
> > >> Also +1 in general.
> > >>
> > >> I have a few questions though:
> > >>
> > >> - does it only apply to the logic in
> > >>
> > >>
> >
> org.apache.flink.runtime.executiongraph.restart.RestartStrategyFactory#createRestartStrategyFactory,
> > >> which is only the cluster side configuration? Or do you want to change
> > >> the logic also on the job side in ExecutionConfig?
> > >>
> > >> - if the latter, does that mean deprecated methods in ExecutionConfig
> > >> like: setNumberOfExecutionRetries, setExecutionRetryDelay will have no
> > >> effect? I think this would be a good idea, but would suggest to remove
> > >> the corresponding fields and methods. This is not that simple though.
> I
> > >> tried to do that for other parameters that have no effect already like
> > >> codeAnalysisMode & failTaskOnCheckpointError. The are two problems:
> > >>
> > >>     1) setNumberOfExecutionRetires are effectively marked with @Public
> > >> annotation (the codeAnalysisMode & failTaskOnCheckpointError don't
> have
> > >> this problem). Therefore this would be a binary incompatible change.
> > >>
> > >>     2) ExecutionConfig is stored in state as part of PojoSerializer in
> > >> pre flink 1.7. It should not be a problem for
> numberOfExecutionRetries &
> > >> executionRetryDelays as they are of primitive types. It is a problem
> for
> > >> codeAnalysisMode (we cannot remove the class, as this breaks
> > >> serialization). I wanted to mention that anyway, just to be aware of
> > that.
> > >>
> > >> Best,
> > >>
> > >> Dawid
> > >>
> > >> On 30/08/2019 14:48, Stephan Ewen wrote:
> > >> > +1 in general
> > >> >
> > >> > What is the default in batch, though? No restarts? I always found
> that
> > >> > somewhat uncommon.
> > >> > Should we also change that part, if we are changing the default
> > anyways?
> > >> >
> > >> >
> > >> > On Fri, Aug 30, 2019 at 2:35 PM Till Rohrmann <[email protected]
> >
> > >> wrote:
> > >> >
> > >> >> Hi everyone,
> > >> >>
> > >> >> I wanted to discuss how to simplify Flink's cluster level
> > >> RestartStrategy
> > >> >> configuration [1]. Currently, Flink's behaviour with respect to
> > >> configuring
> > >> >> the {{RestartStrategies}} is quite complicated and convoluted. The
> > >> reason
> > >> >> for this is that we evolved the way it has been configured and
> wanted
> > >> to
> > >> >> keep it backwards compatible. Due to this, we have currently the
> > >> following
> > >> >> behaviour:
> > >> >>
> > >> >> * If the config option `restart-strategy` is configured, then Flink
> > >> uses
> > >> >> this `RestartStrategy` (so far so simple)
> > >> >> * If the config option `restart-strategy` is not configured, then
> > >> >> ** If `restart-strategy.fixed-delay.attempts` or
> > >> >> `restart-strategy.fixed-delay.delay` are defined, then instantiate
> > >> >> `FixedDelayRestartStrategy(restart-strategy.fixed-delay.attempts,
> > >> >> restart-strategy.fixed-delay.delay)`
> > >> >> ** If `restart-strategy.fixed-delay.attempts` and
> > >> >> `restart-strategy.fixed-delay.delay` are not defined, then
> > >> >> *** If checkpointing is disabled, then choose `NoRestartStrategy`
> > >> >> *** If checkpointing is enabled, then choose
> > >> >> `FixedDelayRestartStrategy(Integer.MAX_VALUE, "0 s")`
> > >> >>
> > >> >> I would like to simplify the configuration by removing the "If
> > >> >> `restart-strategy.fixed-delay.attempts` or
> > >> >> `restart-strategy.fixed-delay.delay`, then" condition. That way,
> the
> > >> logic
> > >> >> would be the following:
> > >> >>
> > >> >> * If the config option `restart-strategy` is configured, then Flink
> > >> uses
> > >> >> this `RestartStrategy`
> > >> >> * If the config option `restart-strategy` is not configured, then
> > >> >> ** If checkpointing is disabled, then choose `NoRestartStrategy`
> > >> >> ** If checkpointing is enabled, then choose
> > >> >> `FixedDelayRestartStrategy(Integer.MAX_VALUE, "0 s")`
> > >> >>
> > >> >> That way we retain the user friendliness that jobs restart if the
> > user
> > >> >> enabled checkpointing and we make it clear that any `
> > >> >> restart-strategy.fixed-delay.xyz` setting will only be respected
> if
> > >> >> `restart-strategy` has been set to `fixed-delay`.
> > >> >>
> > >> >> This simplification would, however, change Flink's behaviour and
> > might
> > >> >> break existing setups. Since we introduced `RestartStrategies` with
> > >> Flink
> > >> >> 1.0.0 and deprecated the prior configuration mechanism which
> enables
> > >> >> restarting if either the `attempts` or the `delay` has been set, I
> > >> think
> > >> >> that the number of broken jobs should be minimal if not
> non-existent.
> > >> >>
> > >> >> I'm sure that one can simplify the way RestartStrategies are
> > >> >> programmatically configured as well but for the sake of
> > >> simplicity/scoping
> > >> >> I'd like to not touch it right away.
> > >> >>
> > >> >> What do you think about this behaviour change?
> > >> >>
> > >> >> [1] https://issues.apache.org/jira/browse/FLINK-13921
> > >> >>
> > >> >> Cheers,
> > >> >> Till
> > >> >>
> > >>
> > >>
> >
>

Re: [DISCUSS] Simplify Flink's cluster level RestartStrategy configuration

Reply via email to