The link to the dev ML discussion about FLIP-61 is https://lists.apache.org/thread.html/e206390127bcbd9b24d9c41a838faa75157e468e01552ad241e3e24b@%3Cdev.flink.apache.org%3E
Cheers, Till On Mon, Sep 2, 2019 at 10:37 PM Till Rohrmann <trohrm...@apache.org> wrote: > Thanks a lot for the positive feedback. I think you are right Becket that > this needs a FLIP since it changes Flink's behaviour. I'll create one and > post it to the dev ML. > > @Zhu Zhu <reed...@gmail.com> I agree that the restart delay is related > to the RestartStrategy configuration. However, I would like to exclude it > from this improvement proposal since it would broaden the scope > unnecessarily. I'd suggest to continue the discussion about this on the > SURVEY thread and then based on the outcome start a DISCUSS thread about > the concrete changes to the default restart delay value. > > Since I will create a FLIP for the simplification logic, I'll conclude > this thread and would encourage everyone to move all further discussion to > the Flip DISCUSS thread. I'll post the link to it shortly. > > Cheers, > Till > > On Mon, Sep 2, 2019 at 6:22 AM zhijiang <wangzhijiang...@aliyun.com.invalid> > wrote: > >> +1 for this proposal. >> >> IMO, it not only simplifies the cluster configuration, but also seems >> more fit logic to not rely on some low-level speicific parameters to judge >> the upper-level strategy. >> It is also resonable to push forward the restart strategy configuration >> step by step for batch later. >> >> Best, >> Zhijiang >> ------------------------------------------------------------------ >> From:Zhu Zhu <reed...@gmail.com> >> Send Time:2019年9月2日(星期一) 05:18 >> To:dev <dev@flink.apache.org> >> Subject:Re: [DISCUSS] Simplify Flink's cluster level RestartStrategy >> configuration >> >> +1 to simplify the RestartStrategy configuration >> >> One thing to confirm is whether the default delay should be "0 s" in the >> case of >> "If the config option `restart-strategy` is not configured" and "If >> checkpointing is enabled". >> I see a related discussion([SURVEY] Is the default restart delay of 0s >> causing problems) is ongoing and we may need to take the result from that. >> >> Thanks, >> Zhu Zhu >> >> Becket Qin <becket....@gmail.com> 于2019年9月2日周一 上午9:06写道: >> >> > +1. The new behavior makes sense to me. >> > >> > BTW, we need a FLIP for this :) >> > >> > On Fri, Aug 30, 2019 at 10:17 PM Till Rohrmann <trohrm...@apache.org> >> > wrote: >> > >> > > After an offline discussion with Stephan, we concluded that changing >> the >> > > default restart strategy for batch jobs is not that easy because the >> > > cluster level restart configuration does not necessarily know about >> the >> > > type of job which is submitted. We concluded that we would like to >> keep >> > the >> > > batch behaviour as is (NoRestartStrategy) and revisit this issue at a >> > later >> > > point in time. >> > > >> > > On Fri, Aug 30, 2019 at 3:24 PM Till Rohrmann <trohrm...@apache.org> >> > > wrote: >> > > >> > > > The current default behaviour for batch is `NoRestartStrategy` if >> > nothing >> > > > is configured. We could say that we set the default value of >> > > > `restart-strategy` to `FixedDelayRestartStrategy(Integer.MAX_VALUE, >> "0 >> > > s")` >> > > > independent of the checkpointing. The only downside I could see is >> that >> > > > some faulty batch jobs might get stuck in a restart loop without >> > > reaching a >> > > > terminal state. >> > > > >> > > > @Dawid, I don't intend to touch the ExecutionConfig. This change >> only >> > > > targets the cluster level configuration of the RestartStrategy. >> > > > >> > > > Cheers, >> > > > Till >> > > > >> > > > On Fri, Aug 30, 2019 at 3:14 PM Dawid Wysakowicz < >> > dwysakow...@apache.org >> > > > >> > > > wrote: >> > > > >> > > >> Also +1 in general. >> > > >> >> > > >> I have a few questions though: >> > > >> >> > > >> - does it only apply to the logic in >> > > >> >> > > >> >> > > >> > >> org.apache.flink.runtime.executiongraph.restart.RestartStrategyFactory#createRestartStrategyFactory, >> > > >> which is only the cluster side configuration? Or do you want to >> change >> > > >> the logic also on the job side in ExecutionConfig? >> > > >> >> > > >> - if the latter, does that mean deprecated methods in >> ExecutionConfig >> > > >> like: setNumberOfExecutionRetries, setExecutionRetryDelay will >> have no >> > > >> effect? I think this would be a good idea, but would suggest to >> remove >> > > >> the corresponding fields and methods. This is not that simple >> though. >> > I >> > > >> tried to do that for other parameters that have no effect already >> like >> > > >> codeAnalysisMode & failTaskOnCheckpointError. The are two problems: >> > > >> >> > > >> 1) setNumberOfExecutionRetires are effectively marked with >> @Public >> > > >> annotation (the codeAnalysisMode & failTaskOnCheckpointError don't >> > have >> > > >> this problem). Therefore this would be a binary incompatible >> change. >> > > >> >> > > >> 2) ExecutionConfig is stored in state as part of >> PojoSerializer in >> > > >> pre flink 1.7. It should not be a problem for >> > numberOfExecutionRetries & >> > > >> executionRetryDelays as they are of primitive types. It is a >> problem >> > for >> > > >> codeAnalysisMode (we cannot remove the class, as this breaks >> > > >> serialization). I wanted to mention that anyway, just to be aware >> of >> > > that. >> > > >> >> > > >> Best, >> > > >> >> > > >> Dawid >> > > >> >> > > >> On 30/08/2019 14:48, Stephan Ewen wrote: >> > > >> > +1 in general >> > > >> > >> > > >> > What is the default in batch, though? No restarts? I always found >> > that >> > > >> > somewhat uncommon. >> > > >> > Should we also change that part, if we are changing the default >> > > anyways? >> > > >> > >> > > >> > >> > > >> > On Fri, Aug 30, 2019 at 2:35 PM Till Rohrmann < >> trohrm...@apache.org >> > > >> > > >> wrote: >> > > >> > >> > > >> >> Hi everyone, >> > > >> >> >> > > >> >> I wanted to discuss how to simplify Flink's cluster level >> > > >> RestartStrategy >> > > >> >> configuration [1]. Currently, Flink's behaviour with respect to >> > > >> configuring >> > > >> >> the {{RestartStrategies}} is quite complicated and convoluted. >> The >> > > >> reason >> > > >> >> for this is that we evolved the way it has been configured and >> > wanted >> > > >> to >> > > >> >> keep it backwards compatible. Due to this, we have currently the >> > > >> following >> > > >> >> behaviour: >> > > >> >> >> > > >> >> * If the config option `restart-strategy` is configured, then >> Flink >> > > >> uses >> > > >> >> this `RestartStrategy` (so far so simple) >> > > >> >> * If the config option `restart-strategy` is not configured, >> then >> > > >> >> ** If `restart-strategy.fixed-delay.attempts` or >> > > >> >> `restart-strategy.fixed-delay.delay` are defined, then >> instantiate >> > > >> >> >> `FixedDelayRestartStrategy(restart-strategy.fixed-delay.attempts, >> > > >> >> restart-strategy.fixed-delay.delay)` >> > > >> >> ** If `restart-strategy.fixed-delay.attempts` and >> > > >> >> `restart-strategy.fixed-delay.delay` are not defined, then >> > > >> >> *** If checkpointing is disabled, then choose >> `NoRestartStrategy` >> > > >> >> *** If checkpointing is enabled, then choose >> > > >> >> `FixedDelayRestartStrategy(Integer.MAX_VALUE, "0 s")` >> > > >> >> >> > > >> >> I would like to simplify the configuration by removing the "If >> > > >> >> `restart-strategy.fixed-delay.attempts` or >> > > >> >> `restart-strategy.fixed-delay.delay`, then" condition. That way, >> > the >> > > >> logic >> > > >> >> would be the following: >> > > >> >> >> > > >> >> * If the config option `restart-strategy` is configured, then >> Flink >> > > >> uses >> > > >> >> this `RestartStrategy` >> > > >> >> * If the config option `restart-strategy` is not configured, >> then >> > > >> >> ** If checkpointing is disabled, then choose `NoRestartStrategy` >> > > >> >> ** If checkpointing is enabled, then choose >> > > >> >> `FixedDelayRestartStrategy(Integer.MAX_VALUE, "0 s")` >> > > >> >> >> > > >> >> That way we retain the user friendliness that jobs restart if >> the >> > > user >> > > >> >> enabled checkpointing and we make it clear that any ` >> > > >> >> restart-strategy.fixed-delay.xyz` setting will only be >> respected >> > if >> > > >> >> `restart-strategy` has been set to `fixed-delay`. >> > > >> >> >> > > >> >> This simplification would, however, change Flink's behaviour and >> > > might >> > > >> >> break existing setups. Since we introduced `RestartStrategies` >> with >> > > >> Flink >> > > >> >> 1.0.0 and deprecated the prior configuration mechanism which >> > enables >> > > >> >> restarting if either the `attempts` or the `delay` has been >> set, I >> > > >> think >> > > >> >> that the number of broken jobs should be minimal if not >> > non-existent. >> > > >> >> >> > > >> >> I'm sure that one can simplify the way RestartStrategies are >> > > >> >> programmatically configured as well but for the sake of >> > > >> simplicity/scoping >> > > >> >> I'd like to not touch it right away. >> > > >> >> >> > > >> >> What do you think about this behaviour change? >> > > >> >> >> > > >> >> [1] https://issues.apache.org/jira/browse/FLINK-13921 >> > > >> >> >> > > >> >> Cheers, >> > > >> >> Till >> > > >> >> >> > > >> >> > > >> >> > > >> > >> >>