FW to user ML. Hi Jean-Marc,
Could you elaborate more about how you noticed an increasing number of checkpoints are left behind? Is the number of subdirectories under s3://flink-application/checkpoints increasing? And have you set the state TTL? Best, Zakelly On Thu, Jan 2, 2025 at 12:19 PM Zakelly Lan <zakelly....@gmail.com> wrote: > Hi Jean-Marc, > > Could you elaborate more about how you noticed an increasing number of > checkpoints are left behind? Is the number of subdirectories under > s3://flink-application/checkpoints increasing? And have you set the state > TTL? > > > Best, > Zakelly > > On Tue, Dec 31, 2024 at 7:58 PM Jean-Marc Paulin <jm.pau...@gmail.com> > wrote: > >> Hi, >> >> We are on Flink 1.20/Java17 running in a k8s environment, with >> checkpoints enabled on S3 and the following checkpoint options: >> >> execution.checkpointing.dir: s3://flink-application/checkpoints >> execution.checkpointing.externalized-checkpoint-retention: >> DELETE_ON_CANCELLATION >> execution.checkpointing.interval: 150000 ms >> execution.checkpointing.min-pause: 30000 ms >> execution.checkpointing.mode: EXACTLY_ONCE >> execution.checkpointing.savepoint-dir: >> s3://flink-application/savepoints >> execution.checkpointing.timeout: 10 min >> execution.checkpointing.tolerable-failed-checkpoints: "3" >> >> We have been through quite a few flink application restarts due to >> streaming failure for various reasons (mostly kafka related), but also >> flink application changes. The Flink application then tends to be resumed >> from savepoints, but we noticed an increasing number of checkpoints are >> left behind. Is there a built-in way of cleaning these obsolete checkpoints? >> >> I suppose what we do not really understand is the condition(s) under >> which Flink may not clean up checkpoints. Can someone explain? >> >> Thanks >> >> JM >> >