Hello all, I have a question about changelog persist failure. When changelog persist process fails due to an S3 timeout, it seems to lead to the job failure regardless of our "execution.checkpointing.tolerable-failed-checkpoints" configuration being set to 5 with this log. Caused by: java.io.IOException: The upload for 522 has already failed previouslyUpon digging into the source code, I observed that Flink consistently checks the sequence number against the latest failed sequence number, resulting in an IOException. I am curious about the reasoning behind this check as it seems to interfere with the "tolerable-failed-checkpoint" configuration working as expected. Can anyone explain the goal behind this design? Additionally, I'd like to propose a potential solution: What if we adjusted this section to allow failed changelogs to be uploaded on subsequent attempts, up to the "tolerable-failed-checkpoint" limit, before declaring the job as failed?
Thanks in advance Best regards dongwoo