In addition, there’s another viable alternative strategy that could be 
applied with or without the proposed strategy.

We could group the exceptions occurred in an interval by exception
class. Only a distinct exception within an interval is counted as one
failure. 

The upside is that it’s more fine-grained and wouldn’t increase the
unnecessary retry time if the job are failed due to different causes.

Best,
Paul Lam

> 2022年11月4日 17:33,Paul Lam <paullin3...@gmail.com> 写道:
> 
> Hi Weihua,
> 
> +1 for the new restart strategy you suggested.
> 
> We’re also using failure-rate strategy as the cluster-wide default and 
> faced the same problem, which we solved with a similar approach.
> 
> FYI. We added a freeze period config option to failure-rate strategy. 
> The freeze period would prevent counting further errors after the first
>  failure happens, so that a burst errors would not exhaust the 
> number of allow failures. 
> 
> Best,
> Paul Lam
> 
>> 2022年11月4日 16:45,Weihua Hu <huweihua....@gmail.com 
>> <mailto:huweihua....@gmail.com>> 写道:
>> 
>> Hi, everyone
>> 
>> I'd like to bring up a discussion about restart strategy. Flink supports 3
>> kinds of restart strategy. These work very well for jobs with specific
>> configs, but for platform users who manage hundreds of jobs, there is no
>> common strategy to use.
>> 
>> Let me explain the reason. We manage a lot of jobs, some are
>> keyby-connected with one region per job, some are rescale-connected with
>> many regions per job, and when using the failure rate restart strategy, we
>> cannot achieve the same control with the same configuration.
>> 
>> For example, if I want the job to fail when there are 3 exceptions within 5
>> minutes, the config would look like this:
>> 
>>> restart-strategy.failure-rate.max-failures-per-interval: 3
>>> 
>>> restart-strategy.failure-rate.failure-rate-interval: 5 min
>>> 
>> For the keyby-connected job, this config works well.
>> 
>> However, for the rescale-connected job, we need to consider the number of
>> regions and the number of slots per TaskManager. If each TM has 3 slots,
>> and these 3 slots run the task of 3 regions, then when one TaskManager
>> crashes, it will trigger 3 regions to fail, and the job will fail because
>> it exceeds the threshold of the restart strategy. To avoid the effect of
>> single TM crashes, I must increase the max-failures-per-interval to 9, but
>> after the change, user task exceptions will be more tolerant than I want.
>> 
>> 
>> Therefore, I want to introduce a new restart strategy based on time
>> periods. A continuous period of time (e.g., 5 minutes) is divided into
>> segments of a specific length (e.g., 1 minute). If an exception occurs
>> within a segment (no matter how many times), it is marked as a failed
>> segment. Similar to failure-rate restart strategy, the job will fail when
>> there are 'm' failed segments in the interval of 'n' .
>> 
>> In this mode, the keyby-connected and rescale-connected jobs can use
>> unified configurations.
>> 
>> This is a user-relevant change, so if you think this is worth to do, maybe
>> I can create a FLIP to describe it in detail.
>> Best,
>> Weihua
> 

Reply via email to