Re: suggestion of FLINK-10868

Peter Huang Sun, 08 Sep 2019 01:35:14 -0700

Hi Till,

1) From Anyang's request, I think it is reasonable to use two parameters
for the rate as a batch job runs for a while. The failure rate in a small
interval is meaningless.
I think they need a failure count from the beginning as the failure
condition.


@Anyang Hu <huanyang1...@gmail.com>
2) In the current implementation, the
MaximumFailedTaskManagerExceedingException is SuppressRestartsException. It
will exit immediately.


Best Regards
Peter Huang




On Sun, Sep 8, 2019 at 1:27 AM Anyang Hu <huanyang1...@gmail.com> wrote:

> Hi Till,
> Thank you for the reply.
>
> 1. The batch processing may be customized according to the usage scenario.
> For our online batch jobs, we set the interval parameter to 8h.
> 2. For our usage scenario, we need the client to exit immediately when the
> failed Container reaches MAXIMUM_WORKERS_FAILURE_RATE.
>
> Best Regards,
> Anyang
>
> Till Rohrmann <trohrm...@apache.org> 于2019年9月6日周五 下午9:33写道：
>
>> Hi Anyang,
>>
>> thanks for your suggestions.
>>
>> 1) I guess one needs to make this interval configurable. A session
>> cluster could theoretically execute batch as well as streaming tasks and,
>> hence, I doubt that there is an optimal value. Maybe the default could be a
>> bit longer than 1 min, though.
>>
>> 2) Which component to do you want to let terminate immediately?
>>
>> I think we can consider your input while reviewing the PR. If it would be
>> a bigger change, then it would be best to create a follow up issue once
>> FLINK-10868 has been merged.
>>
>> Cheers,
>> Till
>>
>> On Fri, Sep 6, 2019 at 11:42 AM Anyang Hu <huanyang1...@gmail.com> wrote:
>>
>>> Thank you for the reply and look forward to the advice of Till.
>>>
>>> Anyang
>>>
>>> Peter Huang <huangzhenqiu0...@gmail.com> 于2019年9月5日周四 下午11:53写道：
>>>
>>>> Hi Anyang,
>>>>
>>>> Thanks for raising it up. I think it is reasonable as what you
>>>> requested is needed for batch. Let's wait for Till to give some more input.
>>>>
>>>>
>>>>
>>>> Best Regards
>>>> Peter Huang
>>>>
>>>> On Thu, Sep 5, 2019 at 7:02 AM Anyang Hu <huanyang1...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Peter&Till:
>>>>>
>>>>> As commented in the issue
>>>>> <https://issues.apache.org/jira/browse/FLINK-10868#>，We have
>>>>> introduced the FLINK-10868
>>>>> <https://issues.apache.org/jira/browse/FLINK-10868> patch (mainly
>>>>> batch tasks) online, what do you think of the following two suggestions:
>>>>>
>>>>> 1) Parameter control time interval. At present, the default time
>>>>> interval of 1 min is used, which is too short for batch tasks;
>>>>>
>>>>> 2)Parameter Control When the failed Container number reaches
>>>>> MAXIMUM_WORKERS_FAILURE_RATE and JM disconnects whether to perform
>>>>> OnFatalError so that the batch tasks can exit as soon as possible.
>>>>>
>>>>> Best regards,
>>>>> Anyang
>>>>>
>>>>

Re: suggestion of FLINK-10868

Reply via email to