1s sounds a good tradeoff to me. On Mon, Sep 2, 2019 at 1:30 PM Till Rohrmann <trohrm...@apache.org> wrote:
> Thanks a lot for all your feedback. I see there is a slight tendency > towards having a non zero default delay so far. > > However, Yu has brought up some valid points. Maybe I can shed some light > on a). > > Before FLINK-9158 we set the default delay to 10s because Flink did not > support queued scheduling which meant that if one slot was missing/still > being occupied, then Flink would fail right away with > a NoResourceAvailableException. In order to prevent this we added the > delay. This also covered the case when the job was failing because of an > overloaded external system. > > When we finished FLIP-6, we thought that we could improve the user > experience by decreasing the default delay to 0s because all Flink related > problems (slot still occupied, slot missing because of reconnecting TM) > could be handled by the default slot request time out which allowed the > slots to become ready after the scheduling was kicked off. However, we did > not properly take the case of overloaded external systems into account. > > For b) I agree that any default value should be properly documented. This > was clearly an oversight when FLINK-9158 has been merged. Moreover, I > believe that there won't be the solve it all default value. There are > always cases where one needs to adapt it to ones needs. But this is ok. The > goal should be to find the default value which works for most cases. > > So maybe the middle ground between 10s and 0s could be a solution. Setting > the default restart delay to 1s should prevent restart storms caused by > overloaded external systems and still be fast enough to not slow down > recoveries noticeably in most cases. If one needs a super fast recovery, > then one should set the delay value to 0s. If one requires a longer delay > because of a particular infrastructure, then one needs to change the value > too. What do you think? > > Cheers, > Till > > On Sun, Sep 1, 2019 at 11:56 PM Yu Li <car...@gmail.com> wrote: > >> -1 on increasing the default delay to none zero, with below reasons: >> >> a) I could see some concerns about setting the delay to zero in the very >> original JIRA (FLINK-2993 >> <https://issues.apache.org/jira/browse/FLINK-2993>) but later on in >> FLINK-9158 <https://issues.apache.org/jira/browse/FLINK-9158> we still >> decided to make the change, so I'm wondering whether the decision also came >> from any customer requirement? If so, how could we judge whether one >> requirement override the other? >> >> b) There could be valid reasons for both default values depending on >> different use cases, as well as relative work around (like based on latest >> policy, setting the config manually to 10s could resolve the problem >> mentioned), and from former replies to this thread we could see users have >> already taken actions. Changing it back to non-zero again won't affect such >> users but might cause surprises to those depending on 0 as default. >> >> Last but not least, no matter what decision we make this time, I'd >> suggest to make it final and document in our release note explicitly. >> Checking the 1.5.0 release note [1] [2] it seems we didn't mention about >> the change on default restart delay and we'd better learn from it this >> time. Thanks. >> >> [1] >> https://flink.apache.org/news/2018/05/25/release-1.5.0.html#release-notes >> [2] >> https://ci.apache.org/projects/flink/flink-docs-release-1.5/release-notes/flink-1.5.html >> >> Best Regards, >> Yu >> >> >> On Sun, 1 Sep 2019 at 04:33, Steven Wu <stevenz...@gmail.com> wrote: >> >>> +1 on what Zhu Zhu said. >>> >>> We also override the default to 10 s. >>> >>> On Fri, Aug 30, 2019 at 8:58 PM Zhu Zhu <reed...@gmail.com> wrote: >>> >>>> In our production, we usually override the restart delay to be 10 s. >>>> We once encountered cases that external services are overwhelmed by >>>> reconnections from frequent restarted tasks. >>>> As a safer though not optimized option, a default delay larger than 0 s >>>> is better in my opinion. >>>> >>>> >>>> 未来阳光 <2217232...@qq.com> 于2019年8月30日周五 下午10:23写道: >>>> >>>>> Hi, >>>>> >>>>> >>>>> I thinks it's better to increase the default value. +1 >>>>> >>>>> >>>>> Best. >>>>> >>>>> >>>>> >>>>> >>>>> ------------------ 原始邮件 ------------------ >>>>> 发件人: "Till Rohrmann"<trohrm...@apache.org>; >>>>> 发送时间: 2019年8月30日(星期五) 晚上10:07 >>>>> 收件人: "dev"<dev@flink.apache.org>; "user"<u...@flink.apache.org>; >>>>> 主题: [SURVEY] Is the default restart delay of 0s causing problems? >>>>> >>>>> >>>>> >>>>> Hi everyone, >>>>> >>>>> I wanted to reach out to you and ask whether decreasing the default >>>>> delay >>>>> to `0 s` for the fixed delay restart strategy [1] is causing trouble. A >>>>> user reported that he would like to increase the default value because >>>>> it >>>>> can cause restart storms in case of systematic faults [2]. >>>>> >>>>> The downside of increasing the default delay would be a slightly >>>>> increased >>>>> restart time if this config option is not explicitly set. >>>>> >>>>> [1] https://issues.apache.org/jira/browse/FLINK-9158 >>>>> [2] https://issues.apache.org/jira/browse/FLINK-11218 >>>>> >>>>> Cheers, >>>>> Till >>>> >>>>