Hi Jinzhong, Sorry to answer you just now. We have switched from incremental checkpoint to non-incremental checkpoint before, I think one of the reasons is the difficulty to handle properly the clean up of checkpoints on S3. But with the flink operator's periodic savepoint it may change. I'll re-test it then, thanks for the help!
Best, Yang On Wed, 8 Nov 2023 at 06:51, Jinzhong Li <lijinzhong2...@gmail.com> wrote: > Hi Yang, > > I think there is no configuration option available that allow users to > disable checkpoint file cleanup at runtime. > > Does your flink application use incremental checkpoint? > 1) If yes, i think leveraging S3's lifecycle management to clean > checkpoint files is not safe, because it may accidentally delete a file > which is still in use, although the probability is small. > 2) If no, you can try to enable incremental checkpoint and increase the > checkpoint interval to reduce the S3 traffic. > > Yang LI <yang.hunter...@gmail.com> 于2023年11月8日周三 04:58写道: > >> Hi Martijn, >> >> >> We're currently utilizing flink-s3-fs-presto. After reviewing the >> flink-s3-fs-hadoop source code, I believe we would encounter similar issues >> with it as well. >> >> When we say, 'The purpose of a checkpoint, in principle, is that Flink >> manages its lifecycle,' I think it implies that the automatic cleanup of >> old checkpoints is an integral part of Flink's lifecycle management. >> However, is there a configuration option available that allows us to >> disable this automatic cleanup? We're considering leveraging AWS S3's >> lifecycle management capabilities to handle this aspect instead of relying >> on Flink. >> >> Best, >> Yang >> >> On Tue, 7 Nov 2023 at 18:44, Martijn Visser <martijnvis...@apache.org> >> wrote: >> >>> Ah, I actually misread checkpoint and savepoints, sorry. The purpose >>> of a checkpoint in principle is that Flink manages its lifecycle. >>> Which S3 interface are you using for the checkpoint storage? >>> >>> On Tue, Nov 7, 2023 at 6:39 PM Martijn Visser <martijnvis...@apache.org> >>> wrote: >>> > >>> > Hi Yang, >>> > >>> > If you use the NO_CLAIM mode, Flink will not assume ownership of a >>> > snapshot and leave it up to the user to delete them. See the blog [1] >>> > for more details. >>> > >>> > Best regards, >>> > >>> > Martijn >>> > >>> > [1] >>> https://flink.apache.org/2022/05/06/improvements-to-flink-operations-snapshots-ownership-and-savepoint-formats/#no_claim-default-mode >>> > >>> > On Tue, Nov 7, 2023 at 5:29 PM Junrui Lee <jrlee....@gmail.com> wrote: >>> > > >>> > > Hi Yang, >>> > > >>> > > >>> > > You can try configuring >>> "execution.checkpointing.externalized-checkpoint-retention: >>> RETAIN_ON_CANCELLATION"[1] and increasing the value of >>> "state.checkpoints.num-retained"[2] to retain more checkpoints. >>> > > >>> > > >>> > > Here are the official documentation links for more details: >>> > > >>> > > [1] >>> https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/deployment/config/#execution-checkpointing-externalized-checkpoint-retention >>> > > >>> > > [2] >>> https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/deployment/config/#state-checkpoints-num-retained >>> > > >>> > > >>> > > Best, >>> > > >>> > > Junrui >>> > > >>> > > >>> > > Yang LI <yang.hunter...@gmail.com> 于2023年11月7日周二 22:02写道: >>> > >> >>> > >> Dear Flink Community, >>> > >> >>> > >> In our Flink application, we persist checkpoints to AWS S3. >>> Recently, during periods of high job parallelism and traffic, we've >>> experienced checkpoint failures. Upon investigating, it appears these may >>> be related to S3 delete object requests interrupting checkpoint re-uploads, >>> as evidenced by numerous InterruptedExceptions. >>> > >> >>> > >> We aim to explore options for disabling the deletion of stale >>> checkpoints. Despite consulting the Flink configuration documentation and >>> conducting various tests, the appropriate setting to prevent old checkpoint >>> cleanup remains elusive. >>> > >> >>> > >> Could you advise if there's a method to disable the automatic >>> cleanup of old Flink checkpoints? >>> > >> >>> > >> Best, >>> > >> Yang >>> >>