In addition, there’s another viable alternative strategy that could be applied with or without the proposed strategy.
We could group the exceptions occurred in an interval by exception class. Only a distinct exception within an interval is counted as one failure. The upside is that it’s more fine-grained and wouldn’t increase the unnecessary retry time if the job are failed due to different causes. Best, Paul Lam > 2022年11月4日 17:33,Paul Lam <paullin3...@gmail.com> 写道: > > Hi Weihua, > > +1 for the new restart strategy you suggested. > > We’re also using failure-rate strategy as the cluster-wide default and > faced the same problem, which we solved with a similar approach. > > FYI. We added a freeze period config option to failure-rate strategy. > The freeze period would prevent counting further errors after the first > failure happens, so that a burst errors would not exhaust the > number of allow failures. > > Best, > Paul Lam > >> 2022年11月4日 16:45,Weihua Hu <huweihua....@gmail.com >> <mailto:huweihua....@gmail.com>> 写道: >> >> Hi, everyone >> >> I'd like to bring up a discussion about restart strategy. Flink supports 3 >> kinds of restart strategy. These work very well for jobs with specific >> configs, but for platform users who manage hundreds of jobs, there is no >> common strategy to use. >> >> Let me explain the reason. We manage a lot of jobs, some are >> keyby-connected with one region per job, some are rescale-connected with >> many regions per job, and when using the failure rate restart strategy, we >> cannot achieve the same control with the same configuration. >> >> For example, if I want the job to fail when there are 3 exceptions within 5 >> minutes, the config would look like this: >> >>> restart-strategy.failure-rate.max-failures-per-interval: 3 >>> >>> restart-strategy.failure-rate.failure-rate-interval: 5 min >>> >> For the keyby-connected job, this config works well. >> >> However, for the rescale-connected job, we need to consider the number of >> regions and the number of slots per TaskManager. If each TM has 3 slots, >> and these 3 slots run the task of 3 regions, then when one TaskManager >> crashes, it will trigger 3 regions to fail, and the job will fail because >> it exceeds the threshold of the restart strategy. To avoid the effect of >> single TM crashes, I must increase the max-failures-per-interval to 9, but >> after the change, user task exceptions will be more tolerant than I want. >> >> >> Therefore, I want to introduce a new restart strategy based on time >> periods. A continuous period of time (e.g., 5 minutes) is divided into >> segments of a specific length (e.g., 1 minute). If an exception occurs >> within a segment (no matter how many times), it is marked as a failed >> segment. Similar to failure-rate restart strategy, the job will fail when >> there are 'm' failed segments in the interval of 'n' . >> >> In this mode, the keyby-connected and rescale-connected jobs can use >> unified configurations. >> >> This is a user-relevant change, so if you think this is worth to do, maybe >> I can create a FLIP to describe it in detail. >> Best, >> Weihua >