Hi Steven, As a conclusion, since we will have a meter metric[1] for restarts, customized restart strategy is not needed in your case. Is that right?
[1] https://issues.apache.org/jira/browse/FLINK-14164 Thanks, Zhu Zhu Steven Wu <stevenz...@gmail.com> 于2019年9月25日周三 上午2:30写道: > Zhu Zhu, > > Sorry, I was using different terminology. yes, Flink meter is what I was > talking about regarding "fullRestarts" for threshold based alerting. > > On Mon, Sep 23, 2019 at 7:46 PM Zhu Zhu <reed...@gmail.com> wrote: > >> Steven, >> >> In my mind, Flink counter only stores its accumulated count and reports >> that value. Are you using an external counter directly? >> Maybe Flink Meter/MeterView is what you need? It stores the count and >> calculates the rate. And it will report its "count" as well as "rate" to >> external metric services. >> >> The counter "task_failures" only works if the individual failover >> strategy is enabled. However, it is not a public interface and is not >> suggested to use, as the fine grained recovery (region failover) now >> supersedes it. >> I've opened a ticket[1] to add a metric to show failovers that respects >> fine grained recovery. >> >> [1] https://issues.apache.org/jira/browse/FLINK-14164 >> >> Thanks, >> Zhu Zhu >> >> Steven Wu <stevenz...@gmail.com> 于2019年9月24日周二 上午6:41写道: >> >>> >>> When we setup alert like "fullRestarts > 1" for some rolling window, we >>> want to use counter. if it is a Gauge, "fullRestarts" will never go below 1 >>> after a first full restart. So alert condition will always be true after >>> first job restart. If we can apply a derivative to the Gauge value, I guess >>> alert can probably work. I can explore if that is an option or not. >>> >>> Yeah. Understood that "fullRestart" won't increment when fine grained >>> recovery happened. I think "task_failures" counter already exists in Flink. >>> >>> >>> >>> On Sun, Sep 22, 2019 at 7:59 PM Zhu Zhu <reed...@gmail.com> wrote: >>> >>>> Steven, >>>> >>>> Thanks for the information. If we can determine this a common issue, we >>>> can solve it in Flink core. >>>> To get to that state, I have two questions which need your help: >>>> 1. Why is gauge not good for alerting? The metric "fullRestart" is a >>>> Gauge<Long>. Does the metric reporter you use report Counter and >>>> Gauge<Long> to external services in different ways? Or anything else can be >>>> different due to the metric type? >>>> 2. Is the "number of restarts" what you actually need, rather than >>>> the "fullRestart" count? If so, I believe we will have such a counter >>>> metric in 1.10, since the previous "fullRestart" metric value is not the >>>> number of restarts when grained recovery (feature added 1.9.0) is enabled. >>>> "fullRestart" reveals how many times entire job graph has been >>>> restarted. If grained recovery (feature added 1.9.0) is enabled, the graph >>>> would not be restarted when task failures happen and the "fullRestart" >>>> value will not increment in such cases. >>>> >>>> I'd appreciate if you can help with these questions and we can make >>>> better decisions for Flink. >>>> >>>> Thanks, >>>> Zhu Zhu >>>> >>>> Steven Wu <stevenz...@gmail.com> 于2019年9月22日周日 上午3:31写道: >>>> >>>>> Zhu Zhu, >>>>> >>>>> Flink fullRestart metric is a Gauge, which is not good for alerting >>>>> on. We publish an equivalent Counter metric for alerting purpose. >>>>> >>>>> Thanks, >>>>> Steven >>>>> >>>>> On Thu, Sep 19, 2019 at 7:45 PM Zhu Zhu <reed...@gmail.com> wrote: >>>>> >>>>>> Thanks Steven for the feedback! >>>>>> Could you share more information about the metrics you add in you >>>>>> customized restart strategy? >>>>>> >>>>>> Thanks, >>>>>> Zhu Zhu >>>>>> >>>>>> Steven Wu <stevenz...@gmail.com> 于2019年9月20日周五 上午7:11写道: >>>>>> >>>>>>> We do use config like "restart-strategy: >>>>>>> org.foobar.MyRestartStrategyFactoryFactory". Mainly to add additional >>>>>>> metrics than the Flink provided ones. >>>>>>> >>>>>>> On Thu, Sep 19, 2019 at 4:50 AM Zhu Zhu <reed...@gmail.com> wrote: >>>>>>> >>>>>>>> Thanks everyone for the input. >>>>>>>> >>>>>>>> The RestartStrategy customization is not recognized as a public >>>>>>>> interface as it is not explicitly documented. >>>>>>>> As it is not used from the feedbacks of this survey, I'll conclude >>>>>>>> that we do not need to support customized RestartStrategy for the new >>>>>>>> scheduler in Flink 1.10 >>>>>>>> >>>>>>>> Other usages are still supported, including all the strategies and >>>>>>>> configuring ways described in >>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/task_failure_recovery.html#restart-strategies >>>>>>>> . >>>>>>>> >>>>>>>> Feel free to share in this thread if you has any concern for it. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Zhu Zhu >>>>>>>> >>>>>>>> Zhu Zhu <reed...@gmail.com> 于2019年9月12日周四 下午10:33写道: >>>>>>>> >>>>>>>>> Thanks Oytun for the reply! >>>>>>>>> >>>>>>>>> Sorry for not have stated it clearly. When saying "customized >>>>>>>>> RestartStrategy", we mean that users implement an >>>>>>>>> *org.apache.flink.runtime.executiongraph.restart.RestartStrategy* >>>>>>>>> by themselves and use it by configuring like "restart-strategy: >>>>>>>>> org.foobar.MyRestartStrategyFactoryFactory". >>>>>>>>> >>>>>>>>> The usage of restart strategies you mentioned will keep working >>>>>>>>> with the new scheduler. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Zhu Zhu >>>>>>>>> >>>>>>>>> Oytun Tez <oy...@motaword.com> 于2019年9月12日周四 下午10:05写道: >>>>>>>>> >>>>>>>>>> Hi Zhu, >>>>>>>>>> >>>>>>>>>> We are using custom restart strategy like this: >>>>>>>>>> >>>>>>>>>> environment.setRestartStrategy(failureRateRestart(2, >>>>>>>>>> Time.minutes(1), Time.minutes(10))); >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> --- >>>>>>>>>> Oytun Tez >>>>>>>>>> >>>>>>>>>> *M O T A W O R D* >>>>>>>>>> The World's Fastest Human Translation Platform. >>>>>>>>>> oy...@motaword.com — www.motaword.com >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Thu, Sep 12, 2019 at 7:11 AM Zhu Zhu <reed...@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Hi everyone, >>>>>>>>>>> >>>>>>>>>>> I wanted to reach out to you and ask how many of you are using a >>>>>>>>>>> customized RestartStrategy[1] in production jobs. >>>>>>>>>>> >>>>>>>>>>> We are currently developing the new Flink scheduler[2] which >>>>>>>>>>> interacts with restart strategies in a different way. We have to >>>>>>>>>>> re-design >>>>>>>>>>> the interfaces for the new restart strategies (so called >>>>>>>>>>> RestartBackoffTimeStrategy). Existing customized RestartStrategy >>>>>>>>>>> will not >>>>>>>>>>> work any more with the new scheduler. >>>>>>>>>>> >>>>>>>>>>> We want to know whether we should keep the way >>>>>>>>>>> to customized RestartBackoffTimeStrategy so that existing customized >>>>>>>>>>> RestartStrategy can be migrated. >>>>>>>>>>> >>>>>>>>>>> I'd appreciate if you can share the status if you are >>>>>>>>>>> using customized RestartStrategy. That will be valuable for use to >>>>>>>>>>> make >>>>>>>>>>> decisions. >>>>>>>>>>> >>>>>>>>>>> [1] >>>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-master/dev/task_failure_recovery.html#restart-strategies >>>>>>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-10429 >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> Zhu Zhu >>>>>>>>>>> >>>>>>>>>>