Hi Yang, I think there is no configuration option available that allow users to disable checkpoint file cleanup at runtime.
Does your flink application use incremental checkpoint? 1) If yes, i think leveraging S3's lifecycle management to clean checkpoint files is not safe, because it may accidentally delete a file which is still in use, although the probability is small. 2) If no, you can try to enable incremental checkpoint and increase the checkpoint interval to reduce the S3 traffic. Yang LI <yang.hunter...@gmail.com> 于2023年11月8日周三 04:58写道: > Hi Martijn, > > > We're currently utilizing flink-s3-fs-presto. After reviewing the > flink-s3-fs-hadoop source code, I believe we would encounter similar issues > with it as well. > > When we say, 'The purpose of a checkpoint, in principle, is that Flink > manages its lifecycle,' I think it implies that the automatic cleanup of > old checkpoints is an integral part of Flink's lifecycle management. > However, is there a configuration option available that allows us to > disable this automatic cleanup? We're considering leveraging AWS S3's > lifecycle management capabilities to handle this aspect instead of relying > on Flink. > > Best, > Yang > > On Tue, 7 Nov 2023 at 18:44, Martijn Visser <martijnvis...@apache.org> > wrote: > >> Ah, I actually misread checkpoint and savepoints, sorry. The purpose >> of a checkpoint in principle is that Flink manages its lifecycle. >> Which S3 interface are you using for the checkpoint storage? >> >> On Tue, Nov 7, 2023 at 6:39 PM Martijn Visser <martijnvis...@apache.org> >> wrote: >> > >> > Hi Yang, >> > >> > If you use the NO_CLAIM mode, Flink will not assume ownership of a >> > snapshot and leave it up to the user to delete them. See the blog [1] >> > for more details. >> > >> > Best regards, >> > >> > Martijn >> > >> > [1] >> https://flink.apache.org/2022/05/06/improvements-to-flink-operations-snapshots-ownership-and-savepoint-formats/#no_claim-default-mode >> > >> > On Tue, Nov 7, 2023 at 5:29 PM Junrui Lee <jrlee....@gmail.com> wrote: >> > > >> > > Hi Yang, >> > > >> > > >> > > You can try configuring >> "execution.checkpointing.externalized-checkpoint-retention: >> RETAIN_ON_CANCELLATION"[1] and increasing the value of >> "state.checkpoints.num-retained"[2] to retain more checkpoints. >> > > >> > > >> > > Here are the official documentation links for more details: >> > > >> > > [1] >> https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/deployment/config/#execution-checkpointing-externalized-checkpoint-retention >> > > >> > > [2] >> https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/deployment/config/#state-checkpoints-num-retained >> > > >> > > >> > > Best, >> > > >> > > Junrui >> > > >> > > >> > > Yang LI <yang.hunter...@gmail.com> 于2023年11月7日周二 22:02写道: >> > >> >> > >> Dear Flink Community, >> > >> >> > >> In our Flink application, we persist checkpoints to AWS S3. >> Recently, during periods of high job parallelism and traffic, we've >> experienced checkpoint failures. Upon investigating, it appears these may >> be related to S3 delete object requests interrupting checkpoint re-uploads, >> as evidenced by numerous InterruptedExceptions. >> > >> >> > >> We aim to explore options for disabling the deletion of stale >> checkpoints. Despite consulting the Flink configuration documentation and >> conducting various tests, the appropriate setting to prevent old checkpoint >> cleanup remains elusive. >> > >> >> > >> Could you advise if there's a method to disable the automatic >> cleanup of old Flink checkpoints? >> > >> >> > >> Best, >> > >> Yang >> >