Re: Cleanup for high-availability.storageDir

Alexis Sarda-Espinosa Wed, 07 Dec 2022 06:17:59 -0800

Hi Matthias,

Then the explanation is likely that the job has not reached a terminal
state. I was testing updates *without* savepoints (but with HA), so I guess
that never triggers automatic cleanup.


Since, in my case, the job will theoretically never reach a terminal state
with this configuration, would it cause issues if I clean the artifacts
manually?

*And for completeness, I also see an artifact called completedCheckpointXYZ
which is updated over time, and I imagine that should never be removed.

Regards,
Alexis.

Am Mi., 7. Dez. 2022 um 13:03 Uhr schrieb Matthias Pohl <
matthias.p...@aiven.io>:

> Flink should already take care of cleaning the artifacts you mentioned.
> Flink 1.15+ even includes retries if something went wrong. There are still
> a few bugs that need to be fixed (e.g. FLINK-27355 [1]). Checkpoint HA data
> is not properly cleaned up, yet, which is covered by FLIP-270 [2].
>
> It would be interesting to know why these artifacts haven't been deleted
> assuming that the corresponding job is actually in a final state (e.g.
> FAILED, CANCELLED, FINISHED), i.e. there is a JobResultStoreEntry file for
> that specific job available in the folder Gyula was referring to in the
> linked documentation. At least for the JobGraph files, it's likely that you
> have additional metadata still stored in your HA backend (that refers to
> the files). That would be something you might want to clean up as well, if
> you want to do a proper cleanup. But still, it would be good to understand
> why these files are not cleaned up by Flink.
>
> Best,
> Matthias
>
> [1] https://issues.apache.org/jira/browse/FLINK-27355
> [2]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-270%3A+Repeatable+Cleanup+of+Checkpoints
>
> On Tue, Dec 6, 2022 at 5:42 PM Alexis Sarda-Espinosa <
> sarda.espin...@gmail.com> wrote:
>
>> One concrete question, under the HA folder I also see these sample
>> entries:
>>
>> - job_name/blob/job_uuid/blob_...
>> - job_name/submittedJobGraphX
>> - job_name/submittedJobGraphY
>>
>> Is it safe to clean these up when the job is in a healthy state?
>>
>> Regards,
>> Alexis.
>>
>> Am Mo., 5. Dez. 2022 um 20:09 Uhr schrieb Alexis Sarda-Espinosa <
>> sarda.espin...@gmail.com>:
>>
>>> Hi Gyula,
>>>
>>> that certainly helps, but to set up automatic cleanup (in my case, of
>>> azure blob storage), the ideal option would be to be able to set a simple
>>> policy that deletes blobs that haven't been updated in some time, but that
>>> would assume that anything that's actually relevant for the latest state is
>>> "touched" by the JM on every checkpoint, and since I also see blobs
>>> referencing "submitted job graphs", I imagine that might not be a safe
>>> assumption.
>>>
>>> I understand the life cycle of those blobs isn't directly managed by the
>>> operator, but in that regard it could make things more cumbersome.
>>>
>>> Ideally, Flink itself would guarantee this sort of allowable TTL for HA
>>> files, but I'm sure that's not trivial.
>>>
>>> Regards,
>>> Alexis.
>>>
>>> On Mon, 5 Dec 2022, 19:19 Gyula Fóra, <gyula.f...@gmail.com> wrote:
>>>
>>>> Hi!
>>>>
>>>> There are some files that are not cleaned up over time in the HA dir
>>>> that need to be cleaned up by the user:
>>>>
>>>>
>>>> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/concepts/overview/#jobresultstore-resource-leak
>>>>
>>>>
>>>> Hope this helps
>>>> Gyula
>>>>
>>>> On Mon, 5 Dec 2022 at 11:56, Alexis Sarda-Espinosa <
>>>> sarda.espin...@gmail.com> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> I see the number of entries in the directory configured for HA
>>>>> increases over time, particularly in the context of job upgrades in a
>>>>> Kubernetes environment managed by the operator. Would it be safe to assume
>>>>> that any files that haven't been updated in a while can be deleted?
>>>>> Assuming the checkpointing interval is much smaller than the period used 
>>>>> to
>>>>> determine if files are too old.
>>>>>
>>>>> Regards,
>>>>> Alexis.
>>>>>
>>>>>

Re: Cleanup for high-availability.storageDir

Reply via email to