Re: [DISCUSS] Simplify Flink's cluster level RestartStrategy configuration

Till Rohrmann Mon, 02 Sep 2019 13:38:25 -0700

Thanks a lot for the positive feedback. I think you are right Becket that
this needs a FLIP since it changes Flink's behaviour. I'll create one and
post it to the dev ML.


@Zhu Zhu <reed...@gmail.com>  I agree that the restart delay is related to
the RestartStrategy configuration. However, I would like to exclude it from
this improvement proposal since it would broaden the scope unnecessarily.
I'd suggest to continue the discussion about this on the SURVEY thread and
then based on the outcome start a DISCUSS thread about the concrete changes
to the default restart delay value.

Since I will create a FLIP for the simplification logic, I'll conclude this
thread and would encourage everyone to move all further discussion to the
Flip DISCUSS thread. I'll post the link to it shortly.

Cheers,
Till

On Mon, Sep 2, 2019 at 6:22 AM zhijiang <wangzhijiang...@aliyun.com.invalid>
wrote:

> +1 for this proposal.
>
> IMO, it not only simplifies the cluster configuration, but also seems more
> fit logic to not rely on some low-level speicific parameters to judge the
> upper-level strategy.
> It is also resonable to push forward the restart strategy configuration
> step by step for batch later.
>
> Best,
> Zhijiang
> ------------------------------------------------------------------
> From:Zhu Zhu <reed...@gmail.com>
> Send Time:2019年9月2日(星期一) 05:18
> To:dev <dev@flink.apache.org>
> Subject:Re: [DISCUSS] Simplify Flink's cluster level RestartStrategy
> configuration
>
> +1 to simplify the RestartStrategy configuration
>
> One thing to confirm is whether the default delay should be "0 s" in the
> case of
> "If the config option `restart-strategy` is not configured" and "If
> checkpointing is enabled".
> I see a related discussion([SURVEY] Is the default restart delay of 0s
> causing problems) is ongoing and we may need to take the result from that.
>
> Thanks,
> Zhu Zhu
>
> Becket Qin <becket....@gmail.com> 于2019年9月2日周一 上午9:06写道：
>
> > +1. The new behavior makes sense to me.
> >
> > BTW, we need a FLIP for this :)
> >
> > On Fri, Aug 30, 2019 at 10:17 PM Till Rohrmann <trohrm...@apache.org>
> > wrote:
> >
> > > After an offline discussion with Stephan, we concluded that changing
> the
> > > default restart strategy for batch jobs is not that easy because the
> > > cluster level restart configuration does not necessarily know about the
> > > type of job which is submitted. We concluded that we would like to keep
> > the
> > > batch behaviour as is (NoRestartStrategy) and revisit this issue at a
> > later
> > > point in time.
> > >
> > > On Fri, Aug 30, 2019 at 3:24 PM Till Rohrmann <trohrm...@apache.org>
> > > wrote:
> > >
> > > > The current default behaviour for batch is `NoRestartStrategy` if
> > nothing
> > > > is configured. We could say that we set the default value of
> > > > `restart-strategy` to `FixedDelayRestartStrategy(Integer.MAX_VALUE,
> "0
> > > s")`
> > > > independent of the checkpointing. The only downside I could see is
> that
> > > > some faulty batch jobs might get stuck in a restart loop without
> > > reaching a
> > > > terminal state.
> > > >
> > > > @Dawid, I don't intend to touch the ExecutionConfig. This change only
> > > > targets the cluster level configuration of the RestartStrategy.
> > > >
> > > > Cheers,
> > > > Till
> > > >
> > > > On Fri, Aug 30, 2019 at 3:14 PM Dawid Wysakowicz <
> > dwysakow...@apache.org
> > > >
> > > > wrote:
> > > >
> > > >> Also +1 in general.
> > > >>
> > > >> I have a few questions though:
> > > >>
> > > >> - does it only apply to the logic in
> > > >>
> > > >>
> > >
> >
> org.apache.flink.runtime.executiongraph.restart.RestartStrategyFactory#createRestartStrategyFactory,
> > > >> which is only the cluster side configuration? Or do you want to
> change
> > > >> the logic also on the job side in ExecutionConfig?
> > > >>
> > > >> - if the latter, does that mean deprecated methods in
> ExecutionConfig
> > > >> like: setNumberOfExecutionRetries, setExecutionRetryDelay will have
> no
> > > >> effect? I think this would be a good idea, but would suggest to
> remove
> > > >> the corresponding fields and methods. This is not that simple
> though.
> > I
> > > >> tried to do that for other parameters that have no effect already
> like
> > > >> codeAnalysisMode & failTaskOnCheckpointError. The are two problems:
> > > >>
> > > >>     1) setNumberOfExecutionRetires are effectively marked with
> @Public
> > > >> annotation (the codeAnalysisMode & failTaskOnCheckpointError don't
> > have
> > > >> this problem). Therefore this would be a binary incompatible change.
> > > >>
> > > >>     2) ExecutionConfig is stored in state as part of PojoSerializer
> in
> > > >> pre flink 1.7. It should not be a problem for
> > numberOfExecutionRetries &
> > > >> executionRetryDelays as they are of primitive types. It is a problem
> > for
> > > >> codeAnalysisMode (we cannot remove the class, as this breaks
> > > >> serialization). I wanted to mention that anyway, just to be aware of
> > > that.
> > > >>
> > > >> Best,
> > > >>
> > > >> Dawid
> > > >>
> > > >> On 30/08/2019 14:48, Stephan Ewen wrote:
> > > >> > +1 in general
> > > >> >
> > > >> > What is the default in batch, though? No restarts? I always found
> > that
> > > >> > somewhat uncommon.
> > > >> > Should we also change that part, if we are changing the default
> > > anyways?
> > > >> >
> > > >> >
> > > >> > On Fri, Aug 30, 2019 at 2:35 PM Till Rohrmann <
> trohrm...@apache.org
> > >
> > > >> wrote:
> > > >> >
> > > >> >> Hi everyone,
> > > >> >>
> > > >> >> I wanted to discuss how to simplify Flink's cluster level
> > > >> RestartStrategy
> > > >> >> configuration [1]. Currently, Flink's behaviour with respect to
> > > >> configuring
> > > >> >> the {{RestartStrategies}} is quite complicated and convoluted.
> The
> > > >> reason
> > > >> >> for this is that we evolved the way it has been configured and
> > wanted
> > > >> to
> > > >> >> keep it backwards compatible. Due to this, we have currently the
> > > >> following
> > > >> >> behaviour:
> > > >> >>
> > > >> >> * If the config option `restart-strategy` is configured, then
> Flink
> > > >> uses
> > > >> >> this `RestartStrategy` (so far so simple)
> > > >> >> * If the config option `restart-strategy` is not configured, then
> > > >> >> ** If `restart-strategy.fixed-delay.attempts` or
> > > >> >> `restart-strategy.fixed-delay.delay` are defined, then
> instantiate
> > > >> >> `FixedDelayRestartStrategy(restart-strategy.fixed-delay.attempts,
> > > >> >> restart-strategy.fixed-delay.delay)`
> > > >> >> ** If `restart-strategy.fixed-delay.attempts` and
> > > >> >> `restart-strategy.fixed-delay.delay` are not defined, then
> > > >> >> *** If checkpointing is disabled, then choose `NoRestartStrategy`
> > > >> >> *** If checkpointing is enabled, then choose
> > > >> >> `FixedDelayRestartStrategy(Integer.MAX_VALUE, "0 s")`
> > > >> >>
> > > >> >> I would like to simplify the configuration by removing the "If
> > > >> >> `restart-strategy.fixed-delay.attempts` or
> > > >> >> `restart-strategy.fixed-delay.delay`, then" condition. That way,
> > the
> > > >> logic
> > > >> >> would be the following:
> > > >> >>
> > > >> >> * If the config option `restart-strategy` is configured, then
> Flink
> > > >> uses
> > > >> >> this `RestartStrategy`
> > > >> >> * If the config option `restart-strategy` is not configured, then
> > > >> >> ** If checkpointing is disabled, then choose `NoRestartStrategy`
> > > >> >> ** If checkpointing is enabled, then choose
> > > >> >> `FixedDelayRestartStrategy(Integer.MAX_VALUE, "0 s")`
> > > >> >>
> > > >> >> That way we retain the user friendliness that jobs restart if the
> > > user
> > > >> >> enabled checkpointing and we make it clear that any `
> > > >> >> restart-strategy.fixed-delay.xyz` setting will only be respected
> > if
> > > >> >> `restart-strategy` has been set to `fixed-delay`.
> > > >> >>
> > > >> >> This simplification would, however, change Flink's behaviour and
> > > might
> > > >> >> break existing setups. Since we introduced `RestartStrategies`
> with
> > > >> Flink
> > > >> >> 1.0.0 and deprecated the prior configuration mechanism which
> > enables
> > > >> >> restarting if either the `attempts` or the `delay` has been set,
> I
> > > >> think
> > > >> >> that the number of broken jobs should be minimal if not
> > non-existent.
> > > >> >>
> > > >> >> I'm sure that one can simplify the way RestartStrategies are
> > > >> >> programmatically configured as well but for the sake of
> > > >> simplicity/scoping
> > > >> >> I'd like to not touch it right away.
> > > >> >>
> > > >> >> What do you think about this behaviour change?
> > > >> >>
> > > >> >> [1] https://issues.apache.org/jira/browse/FLINK-13921
> > > >> >>
> > > >> >> Cheers,
> > > >> >> Till
> > > >> >>
> > > >>
> > > >>
> > >
> >
>
>

Re: [DISCUSS] Simplify Flink's cluster level RestartStrategy configuration

Reply via email to