Hi, We noticed it first when running out of space on the S3 PV. We worked out the subdirectory is the job Id, so I don't know how that is meant to be cleaned up when the job is recreated with a new job Id . It might be we need to work something out outside of flink.
But we also noticed a huge number of subdirectories below that (under the current job Id) , and I am not sure why they are not cleaned up. I do not remember anything about a state TTL, so it's probably not set. Is that code ? a property? If so which one. Thanks JM On Thu, 2 Jan 2025, 06:38 Zakelly Lan, <zakelly....@gmail.com> wrote: > FW to user ML. > > Hi Jean-Marc, > > Could you elaborate more about how you noticed an increasing number of > checkpoints are left behind? Is the number of subdirectories under > s3://flink-application/checkpoints increasing? And have you set the state > TTL? > > > Best, > Zakelly > > On Thu, Jan 2, 2025 at 12:19 PM Zakelly Lan <zakelly....@gmail.com> wrote: > >> Hi Jean-Marc, >> >> Could you elaborate more about how you noticed an increasing number of >> checkpoints are left behind? Is the number of subdirectories under >> s3://flink-application/checkpoints increasing? And have you set the state >> TTL? >> >> >> Best, >> Zakelly >> >> On Tue, Dec 31, 2024 at 7:58 PM Jean-Marc Paulin <jm.pau...@gmail.com> >> wrote: >> >>> Hi, >>> >>> We are on Flink 1.20/Java17 running in a k8s environment, with >>> checkpoints enabled on S3 and the following checkpoint options: >>> >>> execution.checkpointing.dir: s3://flink-application/checkpoints >>> execution.checkpointing.externalized-checkpoint-retention: >>> DELETE_ON_CANCELLATION >>> execution.checkpointing.interval: 150000 ms >>> execution.checkpointing.min-pause: 30000 ms >>> execution.checkpointing.mode: EXACTLY_ONCE >>> execution.checkpointing.savepoint-dir: >>> s3://flink-application/savepoints >>> execution.checkpointing.timeout: 10 min >>> execution.checkpointing.tolerable-failed-checkpoints: "3" >>> >>> We have been through quite a few flink application restarts due to >>> streaming failure for various reasons (mostly kafka related), but also >>> flink application changes. The Flink application then tends to be resumed >>> from savepoints, but we noticed an increasing number of checkpoints are >>> left behind. Is there a built-in way of cleaning these obsolete checkpoints? >>> >>> I suppose what we do not really understand is the condition(s) under >>> which Flink may not clean up checkpoints. Can someone explain? >>> >>> Thanks >>> >>> JM >>> >>