Changelog fail leads to job fail regardless of tolerable-failed-checkpoints config

Dongwoo Kim Tue, 20 Jun 2023 03:31:06 -0700

Hello all, I have a question about changelog persist failure.
When changelog persist process fails due to an S3 timeout, it seems to lead
to the job failure regardless of our
"execution.checkpointing.tolerable-failed-checkpoints" configuration being
set to 5 with this log.
Caused by: java.io.IOException: The upload for 522 has already failed
previouslyUpon digging into the source code, I observed that Flink
consistently checks the sequence number against the latest failed sequence
number, resulting in an IOException. I am curious about the reasoning
behind this check as it seems to interfere with the
"tolerable-failed-checkpoint" configuration working as expected.
Can anyone explain the goal behind this design?
Additionally, I'd like to propose a potential solution: What if we adjusted
this section to allow failed changelogs to be uploaded on subsequent
attempts, up to the "tolerable-failed-checkpoint" limit, before declaring
the job as failed?


Thanks in advance

Best regards
dongwoo

Changelog fail leads to job fail regardless of tolerable-failed-checkpoints config

Reply via email to