Hi,

We noticed it first when running out of space on the S3 PV. We worked out
> the subdirectory is the job Id, so I don't know how that is meant to be
> cleaned up when the job is recreated with a new job Id . It might be we
> need to work something out outside of flink.
>

A job id subdirectory is created when you restart the job manually. In a
failover scenario, the previous job id directory will be used.

But we also noticed a huge number of subdirectories below that (under the
> current job Id) , and I am not sure why they are not cleaned up.
>

That's weird. Are these like 'chk-x'? Normally they will be deleted by
flink.

I do not remember anything about a state TTL, so it's probably not set. Is
> that code ? a property? If so which one.


Please read
https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/fault-tolerance/state/#state-time-to-live-ttl
. And if it is a SQL job, please set 'table.exec.state.ttl'.

Besides that, I did a test locally, it seems the job will recover from the
latest checkpoint instead of a savepoint (even though it is the latest),
and all the checkpoints will be properly cleaned up. Did you trigger the
savepoint periodically? And did you clean that up manually?


Best,
Zakelly

On Thu, Jan 2, 2025 at 3:50 PM Jean-Marc Paulin <j...@paulin.co.uk> wrote:

> Hi,
>
> We noticed it first when running out of space on the S3 PV. We worked out
> the subdirectory is the job Id, so I don't know how that is meant to be
> cleaned up when the job is recreated with a new job Id . It might be we
> need to work something out outside of flink.
>
> But we also noticed a huge number of subdirectories below that (under the
> current job Id) , and I am not sure why they are not cleaned up.
>
> I do not remember anything about a state TTL, so it's probably not set. Is
> that code ? a property? If so which one.
>
> Thanks
>
> JM
>
> On Thu, 2 Jan 2025, 06:38 Zakelly Lan, <zakelly....@gmail.com> wrote:
>
>> FW to user ML.
>>
>> Hi Jean-Marc,
>>
>> Could you elaborate more about how you noticed an increasing number of
>> checkpoints are left behind? Is the number of subdirectories under
>> s3://flink-application/checkpoints increasing? And have you set the state
>> TTL?
>>
>>
>> Best,
>> Zakelly
>>
>> On Thu, Jan 2, 2025 at 12:19 PM Zakelly Lan <zakelly....@gmail.com>
>> wrote:
>>
>>> Hi Jean-Marc,
>>>
>>> Could you elaborate more about how you noticed an increasing number of
>>> checkpoints are left behind? Is the number of subdirectories under
>>> s3://flink-application/checkpoints increasing? And have you set the state
>>> TTL?
>>>
>>>
>>> Best,
>>> Zakelly
>>>
>>> On Tue, Dec 31, 2024 at 7:58 PM Jean-Marc Paulin <jm.pau...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> We are on Flink 1.20/Java17 running in a k8s environment, with
>>>> checkpoints enabled on S3 and the following checkpoint options:
>>>>
>>>>     execution.checkpointing.dir: s3://flink-application/checkpoints
>>>>     execution.checkpointing.externalized-checkpoint-retention:
>>>> DELETE_ON_CANCELLATION
>>>>     execution.checkpointing.interval: 150000 ms
>>>>     execution.checkpointing.min-pause: 30000 ms
>>>>     execution.checkpointing.mode: EXACTLY_ONCE
>>>>     execution.checkpointing.savepoint-dir:
>>>> s3://flink-application/savepoints
>>>>     execution.checkpointing.timeout: 10 min
>>>>     execution.checkpointing.tolerable-failed-checkpoints: "3"
>>>>
>>>> We have been through quite a few flink application restarts due to
>>>> streaming failure for various reasons (mostly kafka related), but also
>>>> flink application changes. The Flink application then tends to be resumed
>>>> from savepoints, but we noticed an increasing number of checkpoints are
>>>> left behind. Is there a built-in way of cleaning these obsolete 
>>>> checkpoints?
>>>>
>>>> I suppose what we do not really understand is the condition(s) under
>>>> which Flink may not clean up checkpoints. Can someone explain?
>>>>
>>>> Thanks
>>>>
>>>> JM
>>>>
>>>

Reply via email to