HI, @Paul Lam Thanks for the reply. I think it makes a lot of sense to
distinguish exceptions, but it might add complexity to the restart policy
maintenance, and some exceptions might be wrapped in the
FlinkRuntimeException or something else.
Maybe we can implement the first version based on the time segment.


Can someone grant me edit permissions so that I can create a new FLIP?

Best,
Weihua


On Fri, Nov 4, 2022 at 7:32 PM Paul Lam <paullin3...@gmail.com> wrote:

> In addition, there’s another viable alternative strategy that could be
> applied with or without the proposed strategy.
>
> We could group the exceptions occurred in an interval by exception
> class. Only a distinct exception within an interval is counted as one
> failure.
>
> The upside is that it’s more fine-grained and wouldn’t increase the
> unnecessary retry time if the job are failed due to different causes.
>
> Best,
> Paul Lam
>
> > 2022年11月4日 17:33,Paul Lam <paullin3...@gmail.com> 写道:
> >
> > Hi Weihua,
> >
> > +1 for the new restart strategy you suggested.
> >
> > We’re also using failure-rate strategy as the cluster-wide default and
> > faced the same problem, which we solved with a similar approach.
> >
> > FYI. We added a freeze period config option to failure-rate strategy.
> > The freeze period would prevent counting further errors after the first
> >  failure happens, so that a burst errors would not exhaust the
> > number of allow failures.
> >
> > Best,
> > Paul Lam
> >
> >> 2022年11月4日 16:45,Weihua Hu <huweihua....@gmail.com <mailto:
> huweihua....@gmail.com>> 写道:
> >>
> >> Hi, everyone
> >>
> >> I'd like to bring up a discussion about restart strategy. Flink
> supports 3
> >> kinds of restart strategy. These work very well for jobs with specific
> >> configs, but for platform users who manage hundreds of jobs, there is no
> >> common strategy to use.
> >>
> >> Let me explain the reason. We manage a lot of jobs, some are
> >> keyby-connected with one region per job, some are rescale-connected with
> >> many regions per job, and when using the failure rate restart strategy,
> we
> >> cannot achieve the same control with the same configuration.
> >>
> >> For example, if I want the job to fail when there are 3 exceptions
> within 5
> >> minutes, the config would look like this:
> >>
> >>> restart-strategy.failure-rate.max-failures-per-interval: 3
> >>>
> >>> restart-strategy.failure-rate.failure-rate-interval: 5 min
> >>>
> >> For the keyby-connected job, this config works well.
> >>
> >> However, for the rescale-connected job, we need to consider the number
> of
> >> regions and the number of slots per TaskManager. If each TM has 3 slots,
> >> and these 3 slots run the task of 3 regions, then when one TaskManager
> >> crashes, it will trigger 3 regions to fail, and the job will fail
> because
> >> it exceeds the threshold of the restart strategy. To avoid the effect of
> >> single TM crashes, I must increase the max-failures-per-interval to 9,
> but
> >> after the change, user task exceptions will be more tolerant than I
> want.
> >>
> >>
> >> Therefore, I want to introduce a new restart strategy based on time
> >> periods. A continuous period of time (e.g., 5 minutes) is divided into
> >> segments of a specific length (e.g., 1 minute). If an exception occurs
> >> within a segment (no matter how many times), it is marked as a failed
> >> segment. Similar to failure-rate restart strategy, the job will fail
> when
> >> there are 'm' failed segments in the interval of 'n' .
> >>
> >> In this mode, the keyby-connected and rescale-connected jobs can use
> >> unified configurations.
> >>
> >> This is a user-relevant change, so if you think this is worth to do,
> maybe
> >> I can create a FLIP to describe it in detail.
> >> Best,
> >> Weihua
> >
>
>

Reply via email to