Hi, everyone

I'd like to bring up a discussion about restart strategy. Flink supports 3
kinds of restart strategy. These work very well for jobs with specific
configs, but for platform users who manage hundreds of jobs, there is no
common strategy to use.

Let me explain the reason. We manage a lot of jobs, some are
keyby-connected with one region per job, some are rescale-connected with
many regions per job, and when using the failure rate restart strategy, we
cannot achieve the same control with the same configuration.

For example, if I want the job to fail when there are 3 exceptions within 5
minutes, the config would look like this:

> restart-strategy.failure-rate.max-failures-per-interval: 3
>
> restart-strategy.failure-rate.failure-rate-interval: 5 min
>
For the keyby-connected job, this config works well.

However, for the rescale-connected job, we need to consider the number of
regions and the number of slots per TaskManager. If each TM has 3 slots,
and these 3 slots run the task of 3 regions, then when one TaskManager
crashes, it will trigger 3 regions to fail, and the job will fail because
it exceeds the threshold of the restart strategy. To avoid the effect of
single TM crashes, I must increase the max-failures-per-interval to 9, but
after the change, user task exceptions will be more tolerant than I want.


Therefore, I want to introduce a new restart strategy based on time
periods. A continuous period of time (e.g., 5 minutes) is divided into
segments of a specific length (e.g., 1 minute). If an exception occurs
within a segment (no matter how many times), it is marked as a failed
segment. Similar to failure-rate restart strategy, the job will fail when
there are 'm' failed segments in the interval of 'n' .

In this mode, the keyby-connected and rescale-connected jobs can use
unified configurations.

This is a user-relevant change, so if you think this is worth to do, maybe
I can create a FLIP to describe it in detail.
Best,
Weihua

Reply via email to