Re: suggestion of FLINK-10868

2019-09-12 Thread Peter Huang
Hi Anyang and Till, I think we agreed on making the interval configurable in this case. Let me revise the current PR. You can review it after that. Best Regards Peter Huang On Thu, Sep 12, 2019 at 12:53 AM Anyang Hu wrote: > Thanks Till, I will continue to follow this issue and see what we c

Re: suggestion of FLINK-10868

2019-09-12 Thread Anyang Hu
Thanks Till, I will continue to follow this issue and see what we can do. Best regards, Anyang Till Rohrmann 于2019年9月11日周三 下午5:12写道: > Suggestion 1 makes sense. For the quick termination I think we need to > think a bit more about it to find a good solution also to support strict > SLA requirem

Re: suggestion of FLINK-10868

2019-09-11 Thread Till Rohrmann
Suggestion 1 makes sense. For the quick termination I think we need to think a bit more about it to find a good solution also to support strict SLA requirements. Cheers, Till On Wed, Sep 11, 2019 at 11:11 AM Anyang Hu wrote: > Hi Till, > > Some of our online batch tasks have strict SLA requirem

Re: suggestion of FLINK-10868

2019-09-11 Thread Anyang Hu
Hi Till, Some of our online batch tasks have strict SLA requirements, and they are not allowed to be stuck for a long time. Therefore, we take a rude way to make the job exit immediately. The way to wait for connection recovery is a better solution. Maybe we need to add a timeout to wait for JM to

Re: suggestion of FLINK-10868

2019-09-09 Thread Till Rohrmann
Hi Anyang, I think we cannot take your proposal because this means that whenever we want to call notifyAllocationFailure when there is a connection problem between the RM and the JM, then we fail the whole cluster. This is something a robust and resilient system should not do because connection pr

Re: suggestion of FLINK-10868

2019-09-08 Thread Anyang Hu
Hi Peter, For our online batch task, there is a scene where the failed Container reaches MAXIMUM_WORKERS_FAILURE_RATE but the client will not immediately exit (the probability of JM loss is greatly improved when thousands of Containers is to be started). It is found that the JM disconnection (the

Re: suggestion of FLINK-10868

2019-09-08 Thread Peter Huang
Hi Till, 1) From Anyang's request, I think it is reasonable to use two parameters for the rate as a batch job runs for a while. The failure rate in a small interval is meaningless. I think they need a failure count from the beginning as the failure condition. @Anyang Hu 2) In the current impleme

Re: suggestion of FLINK-10868

2019-09-08 Thread Anyang Hu
Hi Till, Thank you for the reply. 1. The batch processing may be customized according to the usage scenario. For our online batch jobs, we set the interval parameter to 8h. 2. For our usage scenario, we need the client to exit immediately when the failed Container reaches MAXIMUM_WORKERS_FAILURE_R

Re: suggestion of FLINK-10868

2019-09-06 Thread Till Rohrmann
Hi Anyang, thanks for your suggestions. 1) I guess one needs to make this interval configurable. A session cluster could theoretically execute batch as well as streaming tasks and, hence, I doubt that there is an optimal value. Maybe the default could be a bit longer than 1 min, though. 2) Which

Re: suggestion of FLINK-10868

2019-09-06 Thread Anyang Hu
Thank you for the reply and look forward to the advice of Till. Anyang Peter Huang 于2019年9月5日周四 下午11:53写道: > Hi Anyang, > > Thanks for raising it up. I think it is reasonable as what you requested > is needed for batch. Let's wait for Till to give some more input. > > > > Best Regards > Peter H

Re: suggestion of FLINK-10868

2019-09-05 Thread Peter Huang
Hi Anyang, Thanks for raising it up. I think it is reasonable as what you requested is needed for batch. Let's wait for Till to give some more input. Best Regards Peter Huang On Thu, Sep 5, 2019 at 7:02 AM Anyang Hu wrote: > Hi Peter&Till: > > As commented in the issue >

suggestion of FLINK-10868

2019-09-05 Thread Anyang Hu
Hi Peter&Till: As commented in the issue ,We have introduced the FLINK-10868 patch (mainly batch tasks) online, what do you think of the following two suggestions: 1) Parameter control time int