Re: TolerableCheckpointFailureNumber not always applying

2022-05-24 Thread Gaël Renoux
I get the idea, but in our case this was a transient error: it was a network issue, which was solved later without any change in Flink (see last line of stack-trace). Errors in the sync phase are not always non-transient (in our case, they are pretty much never). To be honest, I have trouble imagi

Re: TolerableCheckpointFailureNumber not always applying

2022-05-23 Thread Hangxiang Yu
In my opinion, some exceptions in the async phase like timeout may happen related to the network, state size which will change, so maybe next time these failures will not occur. So the config makes sense for these. But this failure in the sync phase usually means the program will always fail and i

Re: TolerableCheckpointFailureNumber not always applying

2022-05-23 Thread Gaël Renoux
Got it, thank you. I misread the documentation and thought the async referred to the task itself, not the process of taking a checkpoint. I guess there is currently no way to make a job never fail on a failed checkpoint? Gaël Renoux - Lead R&D Engineer E - gael.ren...@datadome.co W - www.datadome

Re: TolerableCheckpointFailureNumber not always applying

2022-05-23 Thread Hangxiang Yu
Hi, Gaël Renoux. As you could see in [1], There are some descriptions about the config: "This only applies to the following failure reasons: IOException on the Job Manager, failures in the async phase on the Task Managers and checkpoint expiration due to a timeout. Failures originating from the syn

TolerableCheckpointFailureNumber not always applying

2022-05-23 Thread Gaël Renoux
Hello everyone, We're having an issue on our Flink job: it restarted because it failed a checkpoint, even though it shouldn't have. We've set the tolerableCheckpointFailureNumber to 1 million to never have the job restart because of this. However, the job did restart following a checkpoint failure