I get the idea, but in our case this was a transient error: it was a
network issue, which was solved later without any change in Flink (see last
line of stack-trace). Errors in the sync phase are not always non-transient
(in our case, they are pretty much never).
To be honest, I have trouble imagi
In my opinion, some exceptions in the async phase like timeout may happen
related to the network, state size which will change, so maybe next time
these failures will not occur. So the config makes sense for these.
But this failure in the sync phase usually means the program will always
fail and i
Got it, thank you. I misread the documentation and thought the async
referred to the task itself, not the process of taking a checkpoint.
I guess there is currently no way to make a job never fail on a failed
checkpoint?
Gaël Renoux - Lead R&D Engineer
E - gael.ren...@datadome.co
W - www.datadome
Hi, Gaël Renoux.
As you could see in [1], There are some descriptions about the config:
"This only applies to the following failure reasons: IOException on the Job
Manager, failures in the async phase on the Task Managers and checkpoint
expiration due to a timeout. Failures originating from the syn
Hello everyone,
We're having an issue on our Flink job: it restarted because it failed a
checkpoint, even though it shouldn't have. We've set the
tolerableCheckpointFailureNumber to 1 million to never have the job restart
because of this. However, the job did restart following a checkpoint
failure