The link to the dev ML discussion about FLIP-61 is
https://lists.apache.org/thread.html/e206390127bcbd9b24d9c41a838faa75157e468e01552ad241e3e24b@%3Cdev.flink.apache.org%3E

Cheers,
Till

On Mon, Sep 2, 2019 at 10:37 PM Till Rohrmann <trohrm...@apache.org> wrote:

> Thanks a lot for the positive feedback. I think you are right Becket that
> this needs a FLIP since it changes Flink's behaviour. I'll create one and
> post it to the dev ML.
>
> @Zhu Zhu <reed...@gmail.com>  I agree that the restart delay is related
> to the RestartStrategy configuration. However, I would like to exclude it
> from this improvement proposal since it would broaden the scope
> unnecessarily. I'd suggest to continue the discussion about this on the
> SURVEY thread and then based on the outcome start a DISCUSS thread about
> the concrete changes to the default restart delay value.
>
> Since I will create a FLIP for the simplification logic, I'll conclude
> this thread and would encourage everyone to move all further discussion to
> the Flip DISCUSS thread. I'll post the link to it shortly.
>
> Cheers,
> Till
>
> On Mon, Sep 2, 2019 at 6:22 AM zhijiang <wangzhijiang...@aliyun.com.invalid>
> wrote:
>
>> +1 for this proposal.
>>
>> IMO, it not only simplifies the cluster configuration, but also seems
>> more fit logic to not rely on some low-level speicific parameters to judge
>> the upper-level strategy.
>> It is also resonable to push forward the restart strategy configuration
>> step by step for batch later.
>>
>> Best,
>> Zhijiang
>> ------------------------------------------------------------------
>> From:Zhu Zhu <reed...@gmail.com>
>> Send Time:2019年9月2日(星期一) 05:18
>> To:dev <dev@flink.apache.org>
>> Subject:Re: [DISCUSS] Simplify Flink's cluster level RestartStrategy
>> configuration
>>
>> +1 to simplify the RestartStrategy configuration
>>
>> One thing to confirm is whether the default delay should be "0 s" in the
>> case of
>> "If the config option `restart-strategy` is not configured" and "If
>> checkpointing is enabled".
>> I see a related discussion([SURVEY] Is the default restart delay of 0s
>> causing problems) is ongoing and we may need to take the result from that.
>>
>> Thanks,
>> Zhu Zhu
>>
>> Becket Qin <becket....@gmail.com> 于2019年9月2日周一 上午9:06写道:
>>
>> > +1. The new behavior makes sense to me.
>> >
>> > BTW, we need a FLIP for this :)
>> >
>> > On Fri, Aug 30, 2019 at 10:17 PM Till Rohrmann <trohrm...@apache.org>
>> > wrote:
>> >
>> > > After an offline discussion with Stephan, we concluded that changing
>> the
>> > > default restart strategy for batch jobs is not that easy because the
>> > > cluster level restart configuration does not necessarily know about
>> the
>> > > type of job which is submitted. We concluded that we would like to
>> keep
>> > the
>> > > batch behaviour as is (NoRestartStrategy) and revisit this issue at a
>> > later
>> > > point in time.
>> > >
>> > > On Fri, Aug 30, 2019 at 3:24 PM Till Rohrmann <trohrm...@apache.org>
>> > > wrote:
>> > >
>> > > > The current default behaviour for batch is `NoRestartStrategy` if
>> > nothing
>> > > > is configured. We could say that we set the default value of
>> > > > `restart-strategy` to `FixedDelayRestartStrategy(Integer.MAX_VALUE,
>> "0
>> > > s")`
>> > > > independent of the checkpointing. The only downside I could see is
>> that
>> > > > some faulty batch jobs might get stuck in a restart loop without
>> > > reaching a
>> > > > terminal state.
>> > > >
>> > > > @Dawid, I don't intend to touch the ExecutionConfig. This change
>> only
>> > > > targets the cluster level configuration of the RestartStrategy.
>> > > >
>> > > > Cheers,
>> > > > Till
>> > > >
>> > > > On Fri, Aug 30, 2019 at 3:14 PM Dawid Wysakowicz <
>> > dwysakow...@apache.org
>> > > >
>> > > > wrote:
>> > > >
>> > > >> Also +1 in general.
>> > > >>
>> > > >> I have a few questions though:
>> > > >>
>> > > >> - does it only apply to the logic in
>> > > >>
>> > > >>
>> > >
>> >
>> org.apache.flink.runtime.executiongraph.restart.RestartStrategyFactory#createRestartStrategyFactory,
>> > > >> which is only the cluster side configuration? Or do you want to
>> change
>> > > >> the logic also on the job side in ExecutionConfig?
>> > > >>
>> > > >> - if the latter, does that mean deprecated methods in
>> ExecutionConfig
>> > > >> like: setNumberOfExecutionRetries, setExecutionRetryDelay will
>> have no
>> > > >> effect? I think this would be a good idea, but would suggest to
>> remove
>> > > >> the corresponding fields and methods. This is not that simple
>> though.
>> > I
>> > > >> tried to do that for other parameters that have no effect already
>> like
>> > > >> codeAnalysisMode & failTaskOnCheckpointError. The are two problems:
>> > > >>
>> > > >>     1) setNumberOfExecutionRetires are effectively marked with
>> @Public
>> > > >> annotation (the codeAnalysisMode & failTaskOnCheckpointError don't
>> > have
>> > > >> this problem). Therefore this would be a binary incompatible
>> change.
>> > > >>
>> > > >>     2) ExecutionConfig is stored in state as part of
>> PojoSerializer in
>> > > >> pre flink 1.7. It should not be a problem for
>> > numberOfExecutionRetries &
>> > > >> executionRetryDelays as they are of primitive types. It is a
>> problem
>> > for
>> > > >> codeAnalysisMode (we cannot remove the class, as this breaks
>> > > >> serialization). I wanted to mention that anyway, just to be aware
>> of
>> > > that.
>> > > >>
>> > > >> Best,
>> > > >>
>> > > >> Dawid
>> > > >>
>> > > >> On 30/08/2019 14:48, Stephan Ewen wrote:
>> > > >> > +1 in general
>> > > >> >
>> > > >> > What is the default in batch, though? No restarts? I always found
>> > that
>> > > >> > somewhat uncommon.
>> > > >> > Should we also change that part, if we are changing the default
>> > > anyways?
>> > > >> >
>> > > >> >
>> > > >> > On Fri, Aug 30, 2019 at 2:35 PM Till Rohrmann <
>> trohrm...@apache.org
>> > >
>> > > >> wrote:
>> > > >> >
>> > > >> >> Hi everyone,
>> > > >> >>
>> > > >> >> I wanted to discuss how to simplify Flink's cluster level
>> > > >> RestartStrategy
>> > > >> >> configuration [1]. Currently, Flink's behaviour with respect to
>> > > >> configuring
>> > > >> >> the {{RestartStrategies}} is quite complicated and convoluted.
>> The
>> > > >> reason
>> > > >> >> for this is that we evolved the way it has been configured and
>> > wanted
>> > > >> to
>> > > >> >> keep it backwards compatible. Due to this, we have currently the
>> > > >> following
>> > > >> >> behaviour:
>> > > >> >>
>> > > >> >> * If the config option `restart-strategy` is configured, then
>> Flink
>> > > >> uses
>> > > >> >> this `RestartStrategy` (so far so simple)
>> > > >> >> * If the config option `restart-strategy` is not configured,
>> then
>> > > >> >> ** If `restart-strategy.fixed-delay.attempts` or
>> > > >> >> `restart-strategy.fixed-delay.delay` are defined, then
>> instantiate
>> > > >> >>
>> `FixedDelayRestartStrategy(restart-strategy.fixed-delay.attempts,
>> > > >> >> restart-strategy.fixed-delay.delay)`
>> > > >> >> ** If `restart-strategy.fixed-delay.attempts` and
>> > > >> >> `restart-strategy.fixed-delay.delay` are not defined, then
>> > > >> >> *** If checkpointing is disabled, then choose
>> `NoRestartStrategy`
>> > > >> >> *** If checkpointing is enabled, then choose
>> > > >> >> `FixedDelayRestartStrategy(Integer.MAX_VALUE, "0 s")`
>> > > >> >>
>> > > >> >> I would like to simplify the configuration by removing the "If
>> > > >> >> `restart-strategy.fixed-delay.attempts` or
>> > > >> >> `restart-strategy.fixed-delay.delay`, then" condition. That way,
>> > the
>> > > >> logic
>> > > >> >> would be the following:
>> > > >> >>
>> > > >> >> * If the config option `restart-strategy` is configured, then
>> Flink
>> > > >> uses
>> > > >> >> this `RestartStrategy`
>> > > >> >> * If the config option `restart-strategy` is not configured,
>> then
>> > > >> >> ** If checkpointing is disabled, then choose `NoRestartStrategy`
>> > > >> >> ** If checkpointing is enabled, then choose
>> > > >> >> `FixedDelayRestartStrategy(Integer.MAX_VALUE, "0 s")`
>> > > >> >>
>> > > >> >> That way we retain the user friendliness that jobs restart if
>> the
>> > > user
>> > > >> >> enabled checkpointing and we make it clear that any `
>> > > >> >> restart-strategy.fixed-delay.xyz` setting will only be
>> respected
>> > if
>> > > >> >> `restart-strategy` has been set to `fixed-delay`.
>> > > >> >>
>> > > >> >> This simplification would, however, change Flink's behaviour and
>> > > might
>> > > >> >> break existing setups. Since we introduced `RestartStrategies`
>> with
>> > > >> Flink
>> > > >> >> 1.0.0 and deprecated the prior configuration mechanism which
>> > enables
>> > > >> >> restarting if either the `attempts` or the `delay` has been
>> set, I
>> > > >> think
>> > > >> >> that the number of broken jobs should be minimal if not
>> > non-existent.
>> > > >> >>
>> > > >> >> I'm sure that one can simplify the way RestartStrategies are
>> > > >> >> programmatically configured as well but for the sake of
>> > > >> simplicity/scoping
>> > > >> >> I'd like to not touch it right away.
>> > > >> >>
>> > > >> >> What do you think about this behaviour change?
>> > > >> >>
>> > > >> >> [1] https://issues.apache.org/jira/browse/FLINK-13921
>> > > >> >>
>> > > >> >> Cheers,
>> > > >> >> Till
>> > > >> >>
>> > > >>
>> > > >>
>> > >
>> >
>>
>>

Reply via email to