Steven,

Thanks for the information. If we can determine this a common issue, we can
solve it in Flink core.
To get to that state, I have two questions which need your help:
1. Why is gauge not good for alerting? The metric "fullRestart" is a
Gauge<Long>. Does the metric reporter you use report Counter and
Gauge<Long> to external services in different ways? Or anything else can be
different due to the metric type?
2. Is the "number of restarts" what you actually need, rather than
the "fullRestart" count? If so, I believe we will have such a counter
metric in 1.10, since the previous "fullRestart" metric value is not the
number of restarts when grained recovery (feature added 1.9.0) is enabled.
    "fullRestart" reveals how many times entire job graph has been
restarted. If grained recovery (feature added 1.9.0) is enabled, the graph
would not be restarted when task failures happen and the "fullRestart"
value will not increment in such cases.

I'd appreciate if you can help with these questions and we can make better
decisions for Flink.

Thanks,
Zhu Zhu

Steven Wu <stevenz...@gmail.com> 于2019年9月22日周日 上午3:31写道:

> Zhu Zhu,
>
> Flink fullRestart metric is a Gauge, which is not good for alerting on. We
> publish an equivalent Counter metric for alerting purpose.
>
> Thanks,
> Steven
>
> On Thu, Sep 19, 2019 at 7:45 PM Zhu Zhu <reed...@gmail.com> wrote:
>
>> Thanks Steven for the feedback!
>> Could you share more information about the metrics you add in you
>> customized restart strategy?
>>
>> Thanks,
>> Zhu Zhu
>>
>> Steven Wu <stevenz...@gmail.com> 于2019年9月20日周五 上午7:11写道:
>>
>>> We do use config like "restart-strategy:
>>> org.foobar.MyRestartStrategyFactoryFactory". Mainly to add additional
>>> metrics than the Flink provided ones.
>>>
>>> On Thu, Sep 19, 2019 at 4:50 AM Zhu Zhu <reed...@gmail.com> wrote:
>>>
>>>> Thanks everyone for the input.
>>>>
>>>> The RestartStrategy customization is not recognized as a public
>>>> interface as it is not explicitly documented.
>>>> As it is not used from the feedbacks of this survey, I'll conclude that
>>>> we do not need to support customized RestartStrategy for the new scheduler
>>>> in Flink 1.10
>>>>
>>>> Other usages are still supported, including all the strategies and
>>>> configuring ways described in
>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/task_failure_recovery.html#restart-strategies
>>>> .
>>>>
>>>> Feel free to share in this thread if you has any concern for it.
>>>>
>>>> Thanks,
>>>> Zhu Zhu
>>>>
>>>> Zhu Zhu <reed...@gmail.com> 于2019年9月12日周四 下午10:33写道:
>>>>
>>>>> Thanks Oytun for the reply!
>>>>>
>>>>> Sorry for not have stated it clearly. When saying "customized
>>>>> RestartStrategy", we mean that users implement an
>>>>> *org.apache.flink.runtime.executiongraph.restart.RestartStrategy* by
>>>>> themselves and use it by configuring like "restart-strategy:
>>>>> org.foobar.MyRestartStrategyFactoryFactory".
>>>>>
>>>>> The usage of restart strategies you mentioned will keep working with
>>>>> the new scheduler.
>>>>>
>>>>> Thanks,
>>>>> Zhu Zhu
>>>>>
>>>>> Oytun Tez <oy...@motaword.com> 于2019年9月12日周四 下午10:05写道:
>>>>>
>>>>>> Hi Zhu,
>>>>>>
>>>>>> We are using custom restart strategy like this:
>>>>>>
>>>>>> environment.setRestartStrategy(failureRateRestart(2, Time.minutes(1),
>>>>>> Time.minutes(10)));
>>>>>>
>>>>>>
>>>>>> ---
>>>>>> Oytun Tez
>>>>>>
>>>>>> *M O T A W O R D*
>>>>>> The World's Fastest Human Translation Platform.
>>>>>> oy...@motaword.com — www.motaword.com
>>>>>>
>>>>>>
>>>>>> On Thu, Sep 12, 2019 at 7:11 AM Zhu Zhu <reed...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi everyone,
>>>>>>>
>>>>>>> I wanted to reach out to you and ask how many of you are using a
>>>>>>> customized RestartStrategy[1] in production jobs.
>>>>>>>
>>>>>>> We are currently developing the new Flink scheduler[2] which
>>>>>>> interacts with restart strategies in a different way. We have to 
>>>>>>> re-design
>>>>>>> the interfaces for the new restart strategies (so called
>>>>>>> RestartBackoffTimeStrategy). Existing customized RestartStrategy will 
>>>>>>> not
>>>>>>> work any more with the new scheduler.
>>>>>>>
>>>>>>> We want to know whether we should keep the way
>>>>>>> to customized RestartBackoffTimeStrategy so that existing customized
>>>>>>> RestartStrategy can be migrated.
>>>>>>>
>>>>>>> I'd appreciate if you can share the status if you are
>>>>>>> using customized RestartStrategy. That will be valuable for use to make
>>>>>>> decisions.
>>>>>>>
>>>>>>> [1]
>>>>>>> https://ci.apache.org/projects/flink/flink-docs-master/dev/task_failure_recovery.html#restart-strategies
>>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-10429
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Zhu Zhu
>>>>>>>
>>>>>>

Reply via email to