Hi Till, 1) From Anyang's request, I think it is reasonable to use two parameters for the rate as a batch job runs for a while. The failure rate in a small interval is meaningless. I think they need a failure count from the beginning as the failure condition.
@Anyang Hu <huanyang1...@gmail.com> 2) In the current implementation, the MaximumFailedTaskManagerExceedingException is SuppressRestartsException. It will exit immediately. Best Regards Peter Huang On Sun, Sep 8, 2019 at 1:27 AM Anyang Hu <huanyang1...@gmail.com> wrote: > Hi Till, > Thank you for the reply. > > 1. The batch processing may be customized according to the usage scenario. > For our online batch jobs, we set the interval parameter to 8h. > 2. For our usage scenario, we need the client to exit immediately when the > failed Container reaches MAXIMUM_WORKERS_FAILURE_RATE. > > Best Regards, > Anyang > > Till Rohrmann <trohrm...@apache.org> 于2019年9月6日周五 下午9:33写道: > >> Hi Anyang, >> >> thanks for your suggestions. >> >> 1) I guess one needs to make this interval configurable. A session >> cluster could theoretically execute batch as well as streaming tasks and, >> hence, I doubt that there is an optimal value. Maybe the default could be a >> bit longer than 1 min, though. >> >> 2) Which component to do you want to let terminate immediately? >> >> I think we can consider your input while reviewing the PR. If it would be >> a bigger change, then it would be best to create a follow up issue once >> FLINK-10868 has been merged. >> >> Cheers, >> Till >> >> On Fri, Sep 6, 2019 at 11:42 AM Anyang Hu <huanyang1...@gmail.com> wrote: >> >>> Thank you for the reply and look forward to the advice of Till. >>> >>> Anyang >>> >>> Peter Huang <huangzhenqiu0...@gmail.com> 于2019年9月5日周四 下午11:53写道: >>> >>>> Hi Anyang, >>>> >>>> Thanks for raising it up. I think it is reasonable as what you >>>> requested is needed for batch. Let's wait for Till to give some more input. >>>> >>>> >>>> >>>> Best Regards >>>> Peter Huang >>>> >>>> On Thu, Sep 5, 2019 at 7:02 AM Anyang Hu <huanyang1...@gmail.com> >>>> wrote: >>>> >>>>> Hi Peter&Till: >>>>> >>>>> As commented in the issue >>>>> <https://issues.apache.org/jira/browse/FLINK-10868#>,We have >>>>> introduced the FLINK-10868 >>>>> <https://issues.apache.org/jira/browse/FLINK-10868> patch (mainly >>>>> batch tasks) online, what do you think of the following two suggestions: >>>>> >>>>> 1) Parameter control time interval. At present, the default time >>>>> interval of 1 min is used, which is too short for batch tasks; >>>>> >>>>> 2)Parameter Control When the failed Container number reaches >>>>> MAXIMUM_WORKERS_FAILURE_RATE and JM disconnects whether to perform >>>>> OnFatalError so that the batch tasks can exit as soon as possible. >>>>> >>>>> Best regards, >>>>> Anyang >>>>> >>>>