1s looks good to me. And I think the conclusion that when a user should override the delay is worth to be documented.
Thanks, Zhu Zhu Steven Wu <stevenz...@gmail.com> 于2019年9月3日周二 上午4:42写道: > 1s sounds a good tradeoff to me. > > On Mon, Sep 2, 2019 at 1:30 PM Till Rohrmann <trohrm...@apache.org> wrote: > >> Thanks a lot for all your feedback. I see there is a slight tendency >> towards having a non zero default delay so far. >> >> However, Yu has brought up some valid points. Maybe I can shed some light >> on a). >> >> Before FLINK-9158 we set the default delay to 10s because Flink did not >> support queued scheduling which meant that if one slot was missing/still >> being occupied, then Flink would fail right away with >> a NoResourceAvailableException. In order to prevent this we added the >> delay. This also covered the case when the job was failing because of an >> overloaded external system. >> >> When we finished FLIP-6, we thought that we could improve the user >> experience by decreasing the default delay to 0s because all Flink related >> problems (slot still occupied, slot missing because of reconnecting TM) >> could be handled by the default slot request time out which allowed the >> slots to become ready after the scheduling was kicked off. However, we did >> not properly take the case of overloaded external systems into account. >> >> For b) I agree that any default value should be properly documented. This >> was clearly an oversight when FLINK-9158 has been merged. Moreover, I >> believe that there won't be the solve it all default value. There are >> always cases where one needs to adapt it to ones needs. But this is ok. The >> goal should be to find the default value which works for most cases. >> >> So maybe the middle ground between 10s and 0s could be a solution. >> Setting the default restart delay to 1s should prevent restart storms >> caused by overloaded external systems and still be fast enough to not slow >> down recoveries noticeably in most cases. If one needs a super fast >> recovery, then one should set the delay value to 0s. If one requires a >> longer delay because of a particular infrastructure, then one needs to >> change the value too. What do you think? >> >> Cheers, >> Till >> >> On Sun, Sep 1, 2019 at 11:56 PM Yu Li <car...@gmail.com> wrote: >> >>> -1 on increasing the default delay to none zero, with below reasons: >>> >>> a) I could see some concerns about setting the delay to zero in the very >>> original JIRA (FLINK-2993 >>> <https://issues.apache.org/jira/browse/FLINK-2993>) but later on in >>> FLINK-9158 <https://issues.apache.org/jira/browse/FLINK-9158> we still >>> decided to make the change, so I'm wondering whether the decision also came >>> from any customer requirement? If so, how could we judge whether one >>> requirement override the other? >>> >>> b) There could be valid reasons for both default values depending on >>> different use cases, as well as relative work around (like based on latest >>> policy, setting the config manually to 10s could resolve the problem >>> mentioned), and from former replies to this thread we could see users have >>> already taken actions. Changing it back to non-zero again won't affect such >>> users but might cause surprises to those depending on 0 as default. >>> >>> Last but not least, no matter what decision we make this time, I'd >>> suggest to make it final and document in our release note explicitly. >>> Checking the 1.5.0 release note [1] [2] it seems we didn't mention about >>> the change on default restart delay and we'd better learn from it this >>> time. Thanks. >>> >>> [1] >>> https://flink.apache.org/news/2018/05/25/release-1.5.0.html#release-notes >>> [2] >>> https://ci.apache.org/projects/flink/flink-docs-release-1.5/release-notes/flink-1.5.html >>> >>> Best Regards, >>> Yu >>> >>> >>> On Sun, 1 Sep 2019 at 04:33, Steven Wu <stevenz...@gmail.com> wrote: >>> >>>> +1 on what Zhu Zhu said. >>>> >>>> We also override the default to 10 s. >>>> >>>> On Fri, Aug 30, 2019 at 8:58 PM Zhu Zhu <reed...@gmail.com> wrote: >>>> >>>>> In our production, we usually override the restart delay to be 10 s. >>>>> We once encountered cases that external services are overwhelmed by >>>>> reconnections from frequent restarted tasks. >>>>> As a safer though not optimized option, a default delay larger than 0 >>>>> s is better in my opinion. >>>>> >>>>> >>>>> 未来阳光 <2217232...@qq.com> 于2019年8月30日周五 下午10:23写道: >>>>> >>>>>> Hi, >>>>>> >>>>>> >>>>>> I thinks it's better to increase the default value. +1 >>>>>> >>>>>> >>>>>> Best. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> ------------------ 原始邮件 ------------------ >>>>>> 发件人: "Till Rohrmann"<trohrm...@apache.org>; >>>>>> 发送时间: 2019年8月30日(星期五) 晚上10:07 >>>>>> 收件人: "dev"<d...@flink.apache.org>; "user"<user@flink.apache.org>; >>>>>> 主题: [SURVEY] Is the default restart delay of 0s causing problems? >>>>>> >>>>>> >>>>>> >>>>>> Hi everyone, >>>>>> >>>>>> I wanted to reach out to you and ask whether decreasing the default >>>>>> delay >>>>>> to `0 s` for the fixed delay restart strategy [1] is causing trouble. >>>>>> A >>>>>> user reported that he would like to increase the default value >>>>>> because it >>>>>> can cause restart storms in case of systematic faults [2]. >>>>>> >>>>>> The downside of increasing the default delay would be a slightly >>>>>> increased >>>>>> restart time if this config option is not explicitly set. >>>>>> >>>>>> [1] https://issues.apache.org/jira/browse/FLINK-9158 >>>>>> [2] https://issues.apache.org/jira/browse/FLINK-11218 >>>>>> >>>>>> Cheers, >>>>>> Till >>>>> >>>>>