Re: [SURVEY] How many people are using customized RestartStrategy(s)

Zhu Zhu Tue, 24 Sep 2019 20:05:20 -0700

Hi Steven,

As a conclusion, since we will have a meter metric[1] for restarts,
customized restart strategy is not needed in your case.
Is that right?


[1] https://issues.apache.org/jira/browse/FLINK-14164

Thanks,
Zhu Zhu

Steven Wu <stevenz...@gmail.com> 于2019年9月25日周三 上午2:30写道：

> Zhu Zhu,
>
> Sorry, I was using different terminology. yes, Flink meter is what I was
> talking about regarding "fullRestarts" for threshold based alerting.
>
> On Mon, Sep 23, 2019 at 7:46 PM Zhu Zhu <reed...@gmail.com> wrote:
>
>> Steven,
>>
>> In my mind, Flink counter only stores its accumulated count and reports
>> that value. Are you using an external counter directly?
>> Maybe Flink Meter/MeterView is what you need? It stores the count and
>> calculates the rate. And it will report its "count" as well as "rate" to
>> external metric services.
>>
>> The counter "task_failures" only works if the individual failover
>> strategy is enabled. However, it is not a public interface and is not
>> suggested to use, as the fine grained recovery (region failover) now
>> supersedes it.
>> I've opened a ticket[1] to add a metric to show failovers that respects
>> fine grained recovery.
>>
>> [1] https://issues.apache.org/jira/browse/FLINK-14164
>>
>> Thanks,
>> Zhu Zhu
>>
>> Steven Wu <stevenz...@gmail.com> 于2019年9月24日周二 上午6:41写道：
>>
>>>
>>> When we setup alert like "fullRestarts > 1" for some rolling window, we
>>> want to use counter. if it is a Gauge, "fullRestarts" will never go below 1
>>> after a first full restart. So alert condition will always be true after
>>> first job restart. If we can apply a derivative to the Gauge value, I guess
>>> alert can probably work. I can explore if that is an option or not.
>>>
>>> Yeah. Understood that "fullRestart" won't increment when fine grained
>>> recovery happened. I think "task_failures" counter already exists in Flink.
>>>
>>>
>>>
>>> On Sun, Sep 22, 2019 at 7:59 PM Zhu Zhu <reed...@gmail.com> wrote:
>>>
>>>> Steven,
>>>>
>>>> Thanks for the information. If we can determine this a common issue, we
>>>> can solve it in Flink core.
>>>> To get to that state, I have two questions which need your help:
>>>> 1. Why is gauge not good for alerting? The metric "fullRestart" is a
>>>> Gauge<Long>. Does the metric reporter you use report Counter and
>>>> Gauge<Long> to external services in different ways? Or anything else can be
>>>> different due to the metric type?
>>>> 2. Is the "number of restarts" what you actually need, rather than
>>>> the "fullRestart" count? If so, I believe we will have such a counter
>>>> metric in 1.10, since the previous "fullRestart" metric value is not the
>>>> number of restarts when grained recovery (feature added 1.9.0) is enabled.
>>>>     "fullRestart" reveals how many times entire job graph has been
>>>> restarted. If grained recovery (feature added 1.9.0) is enabled, the graph
>>>> would not be restarted when task failures happen and the "fullRestart"
>>>> value will not increment in such cases.
>>>>
>>>> I'd appreciate if you can help with these questions and we can make
>>>> better decisions for Flink.
>>>>
>>>> Thanks,
>>>> Zhu Zhu
>>>>
>>>> Steven Wu <stevenz...@gmail.com> 于2019年9月22日周日 上午3:31写道：
>>>>
>>>>> Zhu Zhu,
>>>>>
>>>>> Flink fullRestart metric is a Gauge, which is not good for alerting
>>>>> on. We publish an equivalent Counter metric for alerting purpose.
>>>>>
>>>>> Thanks,
>>>>> Steven
>>>>>
>>>>> On Thu, Sep 19, 2019 at 7:45 PM Zhu Zhu <reed...@gmail.com> wrote:
>>>>>
>>>>>> Thanks Steven for the feedback!
>>>>>> Could you share more information about the metrics you add in you
>>>>>> customized restart strategy?
>>>>>>
>>>>>> Thanks,
>>>>>> Zhu Zhu
>>>>>>
>>>>>> Steven Wu <stevenz...@gmail.com> 于2019年9月20日周五 上午7:11写道：
>>>>>>
>>>>>>> We do use config like "restart-strategy:
>>>>>>> org.foobar.MyRestartStrategyFactoryFactory". Mainly to add additional
>>>>>>> metrics than the Flink provided ones.
>>>>>>>
>>>>>>> On Thu, Sep 19, 2019 at 4:50 AM Zhu Zhu <reed...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Thanks everyone for the input.
>>>>>>>>
>>>>>>>> The RestartStrategy customization is not recognized as a public
>>>>>>>> interface as it is not explicitly documented.
>>>>>>>> As it is not used from the feedbacks of this survey, I'll conclude
>>>>>>>> that we do not need to support customized RestartStrategy for the new
>>>>>>>> scheduler in Flink 1.10
>>>>>>>>
>>>>>>>> Other usages are still supported, including all the strategies and
>>>>>>>> configuring ways described in
>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/task_failure_recovery.html#restart-strategies
>>>>>>>> .
>>>>>>>>
>>>>>>>> Feel free to share in this thread if you has any concern for it.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Zhu Zhu
>>>>>>>>
>>>>>>>> Zhu Zhu <reed...@gmail.com> 于2019年9月12日周四 下午10:33写道：
>>>>>>>>
>>>>>>>>> Thanks Oytun for the reply!
>>>>>>>>>
>>>>>>>>> Sorry for not have stated it clearly. When saying "customized
>>>>>>>>> RestartStrategy", we mean that users implement an
>>>>>>>>> *org.apache.flink.runtime.executiongraph.restart.RestartStrategy*
>>>>>>>>> by themselves and use it by configuring like "restart-strategy:
>>>>>>>>> org.foobar.MyRestartStrategyFactoryFactory".
>>>>>>>>>
>>>>>>>>> The usage of restart strategies you mentioned will keep working
>>>>>>>>> with the new scheduler.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Zhu Zhu
>>>>>>>>>
>>>>>>>>> Oytun Tez <oy...@motaword.com> 于2019年9月12日周四 下午10:05写道：
>>>>>>>>>
>>>>>>>>>> Hi Zhu,
>>>>>>>>>>
>>>>>>>>>> We are using custom restart strategy like this:
>>>>>>>>>>
>>>>>>>>>> environment.setRestartStrategy(failureRateRestart(2,
>>>>>>>>>> Time.minutes(1), Time.minutes(10)));
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ---
>>>>>>>>>> Oytun Tez
>>>>>>>>>>
>>>>>>>>>> *M O T A W O R D*
>>>>>>>>>> The World's Fastest Human Translation Platform.
>>>>>>>>>> oy...@motaword.com — www.motaword.com
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Sep 12, 2019 at 7:11 AM Zhu Zhu <reed...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>
>>>>>>>>>>> I wanted to reach out to you and ask how many of you are using a
>>>>>>>>>>> customized RestartStrategy[1] in production jobs.
>>>>>>>>>>>
>>>>>>>>>>> We are currently developing the new Flink scheduler[2] which
>>>>>>>>>>> interacts with restart strategies in a different way. We have to 
>>>>>>>>>>> re-design
>>>>>>>>>>> the interfaces for the new restart strategies (so called
>>>>>>>>>>> RestartBackoffTimeStrategy). Existing customized RestartStrategy 
>>>>>>>>>>> will not
>>>>>>>>>>> work any more with the new scheduler.
>>>>>>>>>>>
>>>>>>>>>>> We want to know whether we should keep the way
>>>>>>>>>>> to customized RestartBackoffTimeStrategy so that existing customized
>>>>>>>>>>> RestartStrategy can be migrated.
>>>>>>>>>>>
>>>>>>>>>>> I'd appreciate if you can share the status if you are
>>>>>>>>>>> using customized RestartStrategy. That will be valuable for use to 
>>>>>>>>>>> make
>>>>>>>>>>> decisions.
>>>>>>>>>>>
>>>>>>>>>>> [1]
>>>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-master/dev/task_failure_recovery.html#restart-strategies
>>>>>>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-10429
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Zhu Zhu
>>>>>>>>>>>
>>>>>>>>>>

Re: [SURVEY] How many people are using customized RestartStrategy(s)

Reply via email to