Hi Gen,
Thanks for the suggestions!
Regarding how to implement the per-region RestartBackoffTimeStrategy as
proposed previously, I think your approach works well.
Here are more details:
- Keep the RestartBackoffTimeStrategy interface API unchanged and only
change its semantics, such that all str
Hi, all,
Thank you very much for this interesting discussion.
TBH, Dong's proposal made me very excited. Our users don't need to be
tortured by choosing the right one among many strategies.
However, as Gen said, it may need to make some changes to the
RestartBackoffTimeStrategy interface as it d
Hi all,
Sorry for the late jumping in.
To meet Weihua's need, Dong's proposal seems pretty fine, but the
modification it requires, I'm afraid, is not really easy.
RestartBackoffTimeStrategy is quite a simple interface. The strategy even
doesn't know which task is failing, not to mention the divis
Dong’s proposal LGTM.
Best,
Paul Lam
> 2022年11月19日 10:50,Dong Lin 写道:
>
> Hey Weihua,
>
> Thanks for proposing the new strategy!
>
> If I understand correctly, the main issue is that different failover
> regions can be restarted independently, but they share the same counter
> when counting t
Hey Weihua,
Thanks for proposing the new strategy!
If I understand correctly, the main issue is that different failover
regions can be restarted independently, but they share the same counter
when counting the number of failures in an interval. So the number of
failures for a given region is less
HI, @Paul Lam Thanks for the reply. I think it makes a lot of sense to
distinguish exceptions, but it might add complexity to the restart policy
maintenance, and some exceptions might be wrapped in the
FlinkRuntimeException or something else.
Maybe we can implement the first version based on the ti
In addition, there’s another viable alternative strategy that could be
applied with or without the proposed strategy.
We could group the exceptions occurred in an interval by exception
class. Only a distinct exception within an interval is counted as one
failure.
The upside is that it’s more fi
Hi Weihua,
+1 for the new restart strategy you suggested.
We’re also using failure-rate strategy as the cluster-wide default and
faced the same problem, which we solved with a similar approach.
FYI. We added a freeze period config option to failure-rate strategy.
The freeze period would preven
Hi, everyone
I'd like to bring up a discussion about restart strategy. Flink supports 3
kinds of restart strategy. These work very well for jobs with specific
configs, but for platform users who manage hundreds of jobs, there is no
common strategy to use.
Let me explain the reason. We manage a lo