Hi Junrui,
Currently, we have configured our flink cluster with execution.checkpointing.externalized-checkpoint-retention: RETAIN_ON_CANCELLATION and state.checkpoints.num-retained: 10. However, this setup begins to delete the oldest checkpoint once we exceed 10. Typically, by the time substantial traffic spikes occur, we already have the maximum of 10 checkpoints, which limits their utility for our use case. Best, Yang On Tue, 7 Nov 2023 at 20:03, Junrui Lee <jrlee....@gmail.com> wrote: > Hi Yang, > > > You can try configuring > "execution.checkpointing.externalized-checkpoint-retention: > RETAIN_ON_CANCELLATION"[1] and increasing the value of > "state.checkpoints.num-retained"[2] to retain more checkpoints. > > > Here are the official documentation links for more details: > > [1] > https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/deployment/config/#execution-checkpointing-externalized-checkpoint-retention > > [2] > https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/deployment/config/#state-checkpoints-num-retained > > > Best, > > Junrui > > Yang LI <yang.hunter...@gmail.com> 于2023年11月7日周二 22:02写道: > >> Dear Flink Community, >> >> In our Flink application, we persist checkpoints to AWS S3. Recently, >> during periods of high job parallelism and traffic, we've experienced >> checkpoint failures. Upon investigating, it appears these may be related to >> S3 delete object requests interrupting checkpoint re-uploads, as evidenced >> by numerous InterruptedExceptions. >> >> We aim to explore options for disabling the deletion of stale >> checkpoints. Despite consulting the Flink configuration documentation and >> conducting various tests, the appropriate setting to prevent old checkpoint >> cleanup remains elusive. >> >> Could you advise if there's a method to disable the automatic cleanup of >> old Flink checkpoints? >> >> Best, >> Yang >> >