We will then keep the decision that we do not support customized restart strategy in Flink 1.10.
Thanks Steven for the inputs! Thanks, Zhu Zhu Steven Wu <stevenz...@gmail.com> 于2019年9月26日周四 上午12:13写道: > Zhu Zhu, that is correct. > > On Tue, Sep 24, 2019 at 8:04 PM Zhu Zhu <reed...@gmail.com> wrote: > >> Hi Steven, >> >> As a conclusion, since we will have a meter metric[1] for restarts, >> customized restart strategy is not needed in your case. >> Is that right? >> >> [1] https://issues.apache.org/jira/browse/FLINK-14164 >> >> Thanks, >> Zhu Zhu >> >> Steven Wu <stevenz...@gmail.com> 于2019年9月25日周三 上午2:30写道: >> >>> Zhu Zhu, >>> >>> Sorry, I was using different terminology. yes, Flink meter is what I was >>> talking about regarding "fullRestarts" for threshold based alerting. >>> >>> On Mon, Sep 23, 2019 at 7:46 PM Zhu Zhu <reed...@gmail.com> wrote: >>> >>>> Steven, >>>> >>>> In my mind, Flink counter only stores its accumulated count and reports >>>> that value. Are you using an external counter directly? >>>> Maybe Flink Meter/MeterView is what you need? It stores the count and >>>> calculates the rate. And it will report its "count" as well as "rate" to >>>> external metric services. >>>> >>>> The counter "task_failures" only works if the individual failover >>>> strategy is enabled. However, it is not a public interface and is not >>>> suggested to use, as the fine grained recovery (region failover) now >>>> supersedes it. >>>> I've opened a ticket[1] to add a metric to show failovers that respects >>>> fine grained recovery. >>>> >>>> [1] https://issues.apache.org/jira/browse/FLINK-14164 >>>> >>>> Thanks, >>>> Zhu Zhu >>>> >>>> Steven Wu <stevenz...@gmail.com> 于2019年9月24日周二 上午6:41写道: >>>> >>>>> >>>>> When we setup alert like "fullRestarts > 1" for some rolling window, >>>>> we want to use counter. if it is a Gauge, "fullRestarts" will never go >>>>> below 1 after a first full restart. So alert condition will always be true >>>>> after first job restart. If we can apply a derivative to the Gauge value, >>>>> I >>>>> guess alert can probably work. I can explore if that is an option or not. >>>>> >>>>> Yeah. Understood that "fullRestart" won't increment when fine grained >>>>> recovery happened. I think "task_failures" counter already exists in >>>>> Flink. >>>>> >>>>> >>>>> >>>>> On Sun, Sep 22, 2019 at 7:59 PM Zhu Zhu <reed...@gmail.com> wrote: >>>>> >>>>>> Steven, >>>>>> >>>>>> Thanks for the information. If we can determine this a common issue, >>>>>> we can solve it in Flink core. >>>>>> To get to that state, I have two questions which need your help: >>>>>> 1. Why is gauge not good for alerting? The metric "fullRestart" is a >>>>>> Gauge<Long>. Does the metric reporter you use report Counter and >>>>>> Gauge<Long> to external services in different ways? Or anything else can >>>>>> be >>>>>> different due to the metric type? >>>>>> 2. Is the "number of restarts" what you actually need, rather than >>>>>> the "fullRestart" count? If so, I believe we will have such a counter >>>>>> metric in 1.10, since the previous "fullRestart" metric value is not the >>>>>> number of restarts when grained recovery (feature added 1.9.0) is >>>>>> enabled. >>>>>> "fullRestart" reveals how many times entire job graph has been >>>>>> restarted. If grained recovery (feature added 1.9.0) is enabled, the >>>>>> graph >>>>>> would not be restarted when task failures happen and the "fullRestart" >>>>>> value will not increment in such cases. >>>>>> >>>>>> I'd appreciate if you can help with these questions and we can make >>>>>> better decisions for Flink. >>>>>> >>>>>> Thanks, >>>>>> Zhu Zhu >>>>>> >>>>>> Steven Wu <stevenz...@gmail.com> 于2019年9月22日周日 上午3:31写道: >>>>>> >>>>>>> Zhu Zhu, >>>>>>> >>>>>>> Flink fullRestart metric is a Gauge, which is not good for alerting >>>>>>> on. We publish an equivalent Counter metric for alerting purpose. >>>>>>> >>>>>>> Thanks, >>>>>>> Steven >>>>>>> >>>>>>> On Thu, Sep 19, 2019 at 7:45 PM Zhu Zhu <reed...@gmail.com> wrote: >>>>>>> >>>>>>>> Thanks Steven for the feedback! >>>>>>>> Could you share more information about the metrics you add in you >>>>>>>> customized restart strategy? >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Zhu Zhu >>>>>>>> >>>>>>>> Steven Wu <stevenz...@gmail.com> 于2019年9月20日周五 上午7:11写道: >>>>>>>> >>>>>>>>> We do use config like "restart-strategy: >>>>>>>>> org.foobar.MyRestartStrategyFactoryFactory". Mainly to add additional >>>>>>>>> metrics than the Flink provided ones. >>>>>>>>> >>>>>>>>> On Thu, Sep 19, 2019 at 4:50 AM Zhu Zhu <reed...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Thanks everyone for the input. >>>>>>>>>> >>>>>>>>>> The RestartStrategy customization is not recognized as a public >>>>>>>>>> interface as it is not explicitly documented. >>>>>>>>>> As it is not used from the feedbacks of this survey, I'll >>>>>>>>>> conclude that we do not need to support customized RestartStrategy >>>>>>>>>> for the >>>>>>>>>> new scheduler in Flink 1.10 >>>>>>>>>> >>>>>>>>>> Other usages are still supported, including all the strategies >>>>>>>>>> and configuring ways described in >>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/task_failure_recovery.html#restart-strategies >>>>>>>>>> . >>>>>>>>>> >>>>>>>>>> Feel free to share in this thread if you has any concern for it. >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Zhu Zhu >>>>>>>>>> >>>>>>>>>> Zhu Zhu <reed...@gmail.com> 于2019年9月12日周四 下午10:33写道: >>>>>>>>>> >>>>>>>>>>> Thanks Oytun for the reply! >>>>>>>>>>> >>>>>>>>>>> Sorry for not have stated it clearly. When saying "customized >>>>>>>>>>> RestartStrategy", we mean that users implement an >>>>>>>>>>> *org.apache.flink.runtime.executiongraph.restart.RestartStrategy* >>>>>>>>>>> by themselves and use it by configuring like "restart-strategy: >>>>>>>>>>> org.foobar.MyRestartStrategyFactoryFactory". >>>>>>>>>>> >>>>>>>>>>> The usage of restart strategies you mentioned will keep working >>>>>>>>>>> with the new scheduler. >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> Zhu Zhu >>>>>>>>>>> >>>>>>>>>>> Oytun Tez <oy...@motaword.com> 于2019年9月12日周四 下午10:05写道: >>>>>>>>>>> >>>>>>>>>>>> Hi Zhu, >>>>>>>>>>>> >>>>>>>>>>>> We are using custom restart strategy like this: >>>>>>>>>>>> >>>>>>>>>>>> environment.setRestartStrategy(failureRateRestart(2, >>>>>>>>>>>> Time.minutes(1), Time.minutes(10))); >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> --- >>>>>>>>>>>> Oytun Tez >>>>>>>>>>>> >>>>>>>>>>>> *M O T A W O R D* >>>>>>>>>>>> The World's Fastest Human Translation Platform. >>>>>>>>>>>> oy...@motaword.com — www.motaword.com >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Thu, Sep 12, 2019 at 7:11 AM Zhu Zhu <reed...@gmail.com> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi everyone, >>>>>>>>>>>>> >>>>>>>>>>>>> I wanted to reach out to you and ask how many of you are using >>>>>>>>>>>>> a customized RestartStrategy[1] in production jobs. >>>>>>>>>>>>> >>>>>>>>>>>>> We are currently developing the new Flink scheduler[2] which >>>>>>>>>>>>> interacts with restart strategies in a different way. We have to >>>>>>>>>>>>> re-design >>>>>>>>>>>>> the interfaces for the new restart strategies (so called >>>>>>>>>>>>> RestartBackoffTimeStrategy). Existing customized RestartStrategy >>>>>>>>>>>>> will not >>>>>>>>>>>>> work any more with the new scheduler. >>>>>>>>>>>>> >>>>>>>>>>>>> We want to know whether we should keep the way >>>>>>>>>>>>> to customized RestartBackoffTimeStrategy so that existing >>>>>>>>>>>>> customized >>>>>>>>>>>>> RestartStrategy can be migrated. >>>>>>>>>>>>> >>>>>>>>>>>>> I'd appreciate if you can share the status if you are >>>>>>>>>>>>> using customized RestartStrategy. That will be valuable for use >>>>>>>>>>>>> to make >>>>>>>>>>>>> decisions. >>>>>>>>>>>>> >>>>>>>>>>>>> [1] >>>>>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-master/dev/task_failure_recovery.html#restart-strategies >>>>>>>>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-10429 >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks, >>>>>>>>>>>>> Zhu Zhu >>>>>>>>>>>>> >>>>>>>>>>>>