Hello Flink, I have some questions regarding to the guideline on configuring restart strategy.
I was testing a job with the following setup: 1. There are many tasks, but currently I'm running with only 2 parallelism, but plenty of task slots (4 TM and 4 task slot in each TM). 2. It's ran in k8s with HA enabled. 3. The current restart strategy is 'failure-rate' with 30mins failure interval, 1 min delay interval and 3 failure rate. When a TM got removed by k8s, it looked like that caused multiple failure to happen all at once. In the job manager log, I'm seeing different task failed with the same stacktrace 'Heartbeat of taskManager with id {SOME_ID} timed out' around the same time. I understand that all the tasks that were running on this taskManager would fail. But still have these following questions: Questions: 1. What count as one failure for a restartStrategy? It doesn't look like every failed task counts towards one failure according to my other jobs. Is it because we have failover strategy defaults to be region, and each failure only trigger part of the job graph to restart, and the rest of the 'not retriggered' job graph can still cause more failure that will be counted towards failure rate? 2. If that's the case, what will be the recommended way to set restart strategy? If we don't want to hard code a number for every single pipeline we are running, is that a good way to infer how to set the failure rate? Thank you so much! Jiahui