Hi,

We noticed it first when running out of space on the S3 PV. We worked out
the subdirectory is the job Id, so I don't know how that is meant to be
cleaned up when the job is recreated with a new job Id . It might be we
need to work something out outside of flink.

But we also noticed a huge number of subdirectories below that (under the
current job Id) , and I am not sure why they are not cleaned up.

I do not remember anything about a state TTL, so it's probably not set. Is
that code ? a property? If so which one.

Thanks

JM

On Thu, 2 Jan 2025, 06:38 Zakelly Lan, <zakelly....@gmail.com> wrote:

> FW to user ML.
>
> Hi Jean-Marc,
>
> Could you elaborate more about how you noticed an increasing number of
> checkpoints are left behind? Is the number of subdirectories under
> s3://flink-application/checkpoints increasing? And have you set the state
> TTL?
>
>
> Best,
> Zakelly
>
> On Thu, Jan 2, 2025 at 12:19 PM Zakelly Lan <zakelly....@gmail.com> wrote:
>
>> Hi Jean-Marc,
>>
>> Could you elaborate more about how you noticed an increasing number of
>> checkpoints are left behind? Is the number of subdirectories under
>> s3://flink-application/checkpoints increasing? And have you set the state
>> TTL?
>>
>>
>> Best,
>> Zakelly
>>
>> On Tue, Dec 31, 2024 at 7:58 PM Jean-Marc Paulin <jm.pau...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> We are on Flink 1.20/Java17 running in a k8s environment, with
>>> checkpoints enabled on S3 and the following checkpoint options:
>>>
>>>     execution.checkpointing.dir: s3://flink-application/checkpoints
>>>     execution.checkpointing.externalized-checkpoint-retention:
>>> DELETE_ON_CANCELLATION
>>>     execution.checkpointing.interval: 150000 ms
>>>     execution.checkpointing.min-pause: 30000 ms
>>>     execution.checkpointing.mode: EXACTLY_ONCE
>>>     execution.checkpointing.savepoint-dir:
>>> s3://flink-application/savepoints
>>>     execution.checkpointing.timeout: 10 min
>>>     execution.checkpointing.tolerable-failed-checkpoints: "3"
>>>
>>> We have been through quite a few flink application restarts due to
>>> streaming failure for various reasons (mostly kafka related), but also
>>> flink application changes. The Flink application then tends to be resumed
>>> from savepoints, but we noticed an increasing number of checkpoints are
>>> left behind. Is there a built-in way of cleaning these obsolete checkpoints?
>>>
>>> I suppose what we do not really understand is the condition(s) under
>>> which Flink may not clean up checkpoints. Can someone explain?
>>>
>>> Thanks
>>>
>>> JM
>>>
>>

Reply via email to