Re: [DISCUSS]Introduce a time-segment based restart strategy

2022-12-01 Thread Dong Lin
Hi Gen, Thanks for the suggestions! Regarding how to implement the per-region RestartBackoffTimeStrategy as proposed previously, I think your approach works well. Here are more details: - Keep the RestartBackoffTimeStrategy interface API unchanged and only change its semantics, such that all str

Re: [DISCUSS]Introduce a time-segment based restart strategy

2022-11-30 Thread weijie guo
Hi, all, Thank you very much for this interesting discussion. TBH, Dong's proposal made me very excited. Our users don't need to be tortured by choosing the right one among many strategies. However, as Gen said, it may need to make some changes to the RestartBackoffTimeStrategy interface as it d

Re: [DISCUSS]Introduce a time-segment based restart strategy

2022-11-25 Thread Gen Luo
Hi all, Sorry for the late jumping in. To meet Weihua's need, Dong's proposal seems pretty fine, but the modification it requires, I'm afraid, is not really easy. RestartBackoffTimeStrategy is quite a simple interface. The strategy even doesn't know which task is failing, not to mention the divis

Re: [DISCUSS]Introduce a time-segment based restart strategy

2022-11-21 Thread Paul Lam
Dong’s proposal LGTM. Best, Paul Lam > 2022年11月19日 10:50,Dong Lin 写道: > > Hey Weihua, > > Thanks for proposing the new strategy! > > If I understand correctly, the main issue is that different failover > regions can be restarted independently, but they share the same counter > when counting t

Re: [DISCUSS]Introduce a time-segment based restart strategy

2022-11-18 Thread Dong Lin
Hey Weihua, Thanks for proposing the new strategy! If I understand correctly, the main issue is that different failover regions can be restarted independently, but they share the same counter when counting the number of failures in an interval. So the number of failures for a given region is less

Re: [DISCUSS]Introduce a time-segment based restart strategy

2022-11-08 Thread Weihua Hu
HI, @Paul Lam Thanks for the reply. I think it makes a lot of sense to distinguish exceptions, but it might add complexity to the restart policy maintenance, and some exceptions might be wrapped in the FlinkRuntimeException or something else. Maybe we can implement the first version based on the ti

Re: [DISCUSS]Introduce a time-segment based restart strategy

2022-11-04 Thread Paul Lam
In addition, there’s another viable alternative strategy that could be applied with or without the proposed strategy. We could group the exceptions occurred in an interval by exception class. Only a distinct exception within an interval is counted as one failure. The upside is that it’s more fi

Re: [DISCUSS]Introduce a time-segment based restart strategy

2022-11-04 Thread Paul Lam
Hi Weihua, +1 for the new restart strategy you suggested. We’re also using failure-rate strategy as the cluster-wide default and faced the same problem, which we solved with a similar approach. FYI. We added a freeze period config option to failure-rate strategy. The freeze period would preven

[DISCUSS]Introduce a time-segment based restart strategy

2022-11-04 Thread Weihua Hu
Hi, everyone I'd like to bring up a discussion about restart strategy. Flink supports 3 kinds of restart strategy. These work very well for jobs with specific configs, but for platform users who manage hundreds of jobs, there is no common strategy to use. Let me explain the reason. We manage a lo