Dong’s proposal LGTM. Best, Paul Lam
> 2022年11月19日 10:50,Dong Lin <lindon...@gmail.com> 写道: > > Hey Weihua, > > Thanks for proposing the new strategy! > > If I understand correctly, the main issue is that different failover > regions can be restarted independently, but they share the same counter > when counting the number of failures in an interval. So the number of > failures for a given region is less than what users expect. > > Given that regions can be restarted independently, it might be more usable > and intuitive to count the number of failures for each region when > executing the failover strategy. Thus, instead of adding a new failover > strategy, how about we update the existing failure-rate strategy, and > probably other existing strategies as well, to use the following semantics: > > - For any given region in the job, its number of failures in > failure-rate-interval should not exceed max-failures-per-interval. > Otherwise, the job will fail without being restarted. > > By using this updated semantics, the keyby-connected job will have the same > behavior as the existing Flink when we use failure-rate strategy. For > the rescale-connected > job, in the case you described above, after the TM fails, each of the 3 > regions will increment its failure count from 0 to 1, which is still less > than max-failures-per-interval. Thus the rescale-connected job can continue > to work. > > This alternative approach can solve the problem without increasing the > complexity of the failover strategy choice. And this approach does not > require us to check whether two exceptions belong to the same root cause. > Do you think it can work? > > Thanks, > Dong > > > On Fri, Nov 4, 2022 at 4:46 PM Weihua Hu <huweihua....@gmail.com> wrote: > >> Hi, everyone >> >> I'd like to bring up a discussion about restart strategy. Flink supports 3 >> kinds of restart strategy. These work very well for jobs with specific >> configs, but for platform users who manage hundreds of jobs, there is no >> common strategy to use. >> >> Let me explain the reason. We manage a lot of jobs, some are >> keyby-connected with one region per job, some are rescale-connected with >> many regions per job, and when using the failure rate restart strategy, we >> cannot achieve the same control with the same configuration. >> >> For example, if I want the job to fail when there are 3 exceptions within 5 >> minutes, the config would look like this: >> >>> restart-strategy.failure-rate.max-failures-per-interval: 3 >>> >>> restart-strategy.failure-rate.failure-rate-interval: 5 min >>> >> For the keyby-connected job, this config works well. >> >> However, for the rescale-connected job, we need to consider the number of >> regions and the number of slots per TaskManager. If each TM has 3 slots, >> and these 3 slots run the task of 3 regions, then when one TaskManager >> crashes, it will trigger 3 regions to fail, and the job will fail because >> it exceeds the threshold of the restart strategy. To avoid the effect of >> single TM crashes, I must increase the max-failures-per-interval to 9, but >> after the change, user task exceptions will be more tolerant than I want. >> >> >> Therefore, I want to introduce a new restart strategy based on time >> periods. A continuous period of time (e.g., 5 minutes) is divided into >> segments of a specific length (e.g., 1 minute). If an exception occurs >> within a segment (no matter how many times), it is marked as a failed >> segment. Similar to failure-rate restart strategy, the job will fail when >> there are 'm' failed segments in the interval of 'n' . >> >> In this mode, the keyby-connected and rescale-connected jobs can use >> unified configurations. >> >> This is a user-relevant change, so if you think this is worth to do, maybe >> I can create a FLIP to describe it in detail. >> Best, >> Weihua >>