Thanks a lot for all your feedback. I see there is a slight tendency
towards having a non zero default delay so far.

However, Yu has brought up some valid points. Maybe I can shed some light
on a).

Before FLINK-9158 we set the default delay to 10s because Flink did not
support queued scheduling which meant that if one slot was missing/still
being occupied, then Flink would fail right away with
a NoResourceAvailableException. In order to prevent this we added the
delay. This also covered the case when the job was failing because of an
overloaded external system.

When we finished FLIP-6, we thought that we could improve the user
experience by decreasing the default delay to 0s because all Flink related
problems (slot still occupied, slot missing because of reconnecting TM)
could be handled by the default slot request time out which allowed the
slots to become ready after the scheduling was kicked off. However, we did
not properly take the case of overloaded external systems into account.

For b) I agree that any default value should be properly documented. This
was clearly an oversight when FLINK-9158 has been merged. Moreover, I
believe that there won't be the solve it all default value. There are
always cases where one needs to adapt it to ones needs. But this is ok. The
goal should be to find the default value which works for most cases.

So maybe the middle ground between 10s and 0s could be a solution. Setting
the default restart delay to 1s should prevent restart storms caused by
overloaded external systems and still be fast enough to not slow down
recoveries noticeably in most cases. If one needs a super fast recovery,
then one should set the delay value to 0s. If one requires a longer delay
because of a particular infrastructure, then one needs to change the value
too. What do you think?

Cheers,
Till

On Sun, Sep 1, 2019 at 11:56 PM Yu Li <car...@gmail.com> wrote:

> -1 on increasing the default delay to none zero, with below reasons:
>
> a) I could see some concerns about setting the delay to zero in the very
> original JIRA (FLINK-2993
> <https://issues.apache.org/jira/browse/FLINK-2993>) but later on in
> FLINK-9158 <https://issues.apache.org/jira/browse/FLINK-9158> we still
> decided to make the change, so I'm wondering whether the decision also came
> from any customer requirement? If so, how could we judge whether one
> requirement override the other?
>
> b) There could be valid reasons for both default values depending on
> different use cases, as well as relative work around (like based on latest
> policy, setting the config manually to 10s could resolve the problem
> mentioned), and from former replies to this thread we could see users have
> already taken actions. Changing it back to non-zero again won't affect such
> users but might cause surprises to those depending on 0 as default.
>
> Last but not least, no matter what decision we make this time, I'd suggest
> to make it final and document in our release note explicitly. Checking the
> 1.5.0 release note [1] [2] it seems we didn't mention about the change on
> default restart delay and we'd better learn from it this time. Thanks.
>
> [1]
> https://flink.apache.org/news/2018/05/25/release-1.5.0.html#release-notes
> [2]
> https://ci.apache.org/projects/flink/flink-docs-release-1.5/release-notes/flink-1.5.html
>
> Best Regards,
> Yu
>
>
> On Sun, 1 Sep 2019 at 04:33, Steven Wu <stevenz...@gmail.com> wrote:
>
>> +1 on what Zhu Zhu said.
>>
>> We also override the default to 10 s.
>>
>> On Fri, Aug 30, 2019 at 8:58 PM Zhu Zhu <reed...@gmail.com> wrote:
>>
>>> In our production, we usually override the restart delay to be 10 s.
>>> We once encountered cases that external services are overwhelmed by
>>> reconnections from frequent restarted tasks.
>>> As a safer though not optimized option, a default delay larger than 0 s
>>> is better in my opinion.
>>>
>>>
>>> 未来阳光 <2217232...@qq.com> 于2019年8月30日周五 下午10:23写道:
>>>
>>>> Hi,
>>>>
>>>>
>>>> I thinks it's better to increase the default value. +1
>>>>
>>>>
>>>> Best.
>>>>
>>>>
>>>>
>>>>
>>>> ------------------ 原始邮件 ------------------
>>>> 发件人: "Till Rohrmann"<trohrm...@apache.org>;
>>>> 发送时间: 2019年8月30日(星期五) 晚上10:07
>>>> 收件人: "dev"<d...@flink.apache.org>; "user"<user@flink.apache.org>;
>>>> 主题: [SURVEY] Is the default restart delay of 0s causing problems?
>>>>
>>>>
>>>>
>>>> Hi everyone,
>>>>
>>>> I wanted to reach out to you and ask whether decreasing the default
>>>> delay
>>>> to `0 s` for the fixed delay restart strategy [1] is causing trouble. A
>>>> user reported that he would like to increase the default value because
>>>> it
>>>> can cause restart storms in case of systematic faults [2].
>>>>
>>>> The downside of increasing the default delay would be a slightly
>>>> increased
>>>> restart time if this config option is not explicitly set.
>>>>
>>>> [1] https://issues.apache.org/jira/browse/FLINK-9158
>>>> [2] https://issues.apache.org/jira/browse/FLINK-11218
>>>>
>>>> Cheers,
>>>> Till
>>>
>>>

Reply via email to