Hi Weihua, +1 for the new restart strategy you suggested.
We’re also using failure-rate strategy as the cluster-wide default and faced the same problem, which we solved with a similar approach. FYI. We added a freeze period config option to failure-rate strategy. The freeze period would prevent counting further errors after the first failure happens, so that a burst errors would not exhaust the number of allow failures. Best, Paul Lam > 2022年11月4日 16:45,Weihua Hu <huweihua....@gmail.com> 写道: > > Hi, everyone > > I'd like to bring up a discussion about restart strategy. Flink supports 3 > kinds of restart strategy. These work very well for jobs with specific > configs, but for platform users who manage hundreds of jobs, there is no > common strategy to use. > > Let me explain the reason. We manage a lot of jobs, some are > keyby-connected with one region per job, some are rescale-connected with > many regions per job, and when using the failure rate restart strategy, we > cannot achieve the same control with the same configuration. > > For example, if I want the job to fail when there are 3 exceptions within 5 > minutes, the config would look like this: > >> restart-strategy.failure-rate.max-failures-per-interval: 3 >> >> restart-strategy.failure-rate.failure-rate-interval: 5 min >> > For the keyby-connected job, this config works well. > > However, for the rescale-connected job, we need to consider the number of > regions and the number of slots per TaskManager. If each TM has 3 slots, > and these 3 slots run the task of 3 regions, then when one TaskManager > crashes, it will trigger 3 regions to fail, and the job will fail because > it exceeds the threshold of the restart strategy. To avoid the effect of > single TM crashes, I must increase the max-failures-per-interval to 9, but > after the change, user task exceptions will be more tolerant than I want. > > > Therefore, I want to introduce a new restart strategy based on time > periods. A continuous period of time (e.g., 5 minutes) is divided into > segments of a specific length (e.g., 1 minute). If an exception occurs > within a segment (no matter how many times), it is marked as a failed > segment. Similar to failure-rate restart strategy, the job will fail when > there are 'm' failed segments in the interval of 'n' . > > In this mode, the keyby-connected and rescale-connected jobs can use > unified configurations. > > This is a user-relevant change, so if you think this is worth to do, maybe > I can create a FLIP to describe it in detail. > Best, > Weihua