Hi Weihua,

+1 for the new restart strategy you suggested.

We’re also using failure-rate strategy as the cluster-wide default and 
faced the same problem, which we solved with a similar approach.

FYI. We added a freeze period config option to failure-rate strategy. 
The freeze period would prevent counting further errors after the first
 failure happens, so that a burst errors would not exhaust the 
number of allow failures. 

Best,
Paul Lam

> 2022年11月4日 16:45,Weihua Hu <huweihua....@gmail.com> 写道:
> 
> Hi, everyone
> 
> I'd like to bring up a discussion about restart strategy. Flink supports 3
> kinds of restart strategy. These work very well for jobs with specific
> configs, but for platform users who manage hundreds of jobs, there is no
> common strategy to use.
> 
> Let me explain the reason. We manage a lot of jobs, some are
> keyby-connected with one region per job, some are rescale-connected with
> many regions per job, and when using the failure rate restart strategy, we
> cannot achieve the same control with the same configuration.
> 
> For example, if I want the job to fail when there are 3 exceptions within 5
> minutes, the config would look like this:
> 
>> restart-strategy.failure-rate.max-failures-per-interval: 3
>> 
>> restart-strategy.failure-rate.failure-rate-interval: 5 min
>> 
> For the keyby-connected job, this config works well.
> 
> However, for the rescale-connected job, we need to consider the number of
> regions and the number of slots per TaskManager. If each TM has 3 slots,
> and these 3 slots run the task of 3 regions, then when one TaskManager
> crashes, it will trigger 3 regions to fail, and the job will fail because
> it exceeds the threshold of the restart strategy. To avoid the effect of
> single TM crashes, I must increase the max-failures-per-interval to 9, but
> after the change, user task exceptions will be more tolerant than I want.
> 
> 
> Therefore, I want to introduce a new restart strategy based on time
> periods. A continuous period of time (e.g., 5 minutes) is divided into
> segments of a specific length (e.g., 1 minute). If an exception occurs
> within a segment (no matter how many times), it is marked as a failed
> segment. Similar to failure-rate restart strategy, the job will fail when
> there are 'm' failed segments in the interval of 'n' .
> 
> In this mode, the keyby-connected and rescale-connected jobs can use
> unified configurations.
> 
> This is a user-relevant change, so if you think this is worth to do, maybe
> I can create a FLIP to describe it in detail.
> Best,
> Weihua

Reply via email to