Hi Matthias, Then the explanation is likely that the job has not reached a terminal state. I was testing updates *without* savepoints (but with HA), so I guess that never triggers automatic cleanup.
Since, in my case, the job will theoretically never reach a terminal state with this configuration, would it cause issues if I clean the artifacts manually? *And for completeness, I also see an artifact called completedCheckpointXYZ which is updated over time, and I imagine that should never be removed. Regards, Alexis. Am Mi., 7. Dez. 2022 um 13:03 Uhr schrieb Matthias Pohl < matthias.p...@aiven.io>: > Flink should already take care of cleaning the artifacts you mentioned. > Flink 1.15+ even includes retries if something went wrong. There are still > a few bugs that need to be fixed (e.g. FLINK-27355 [1]). Checkpoint HA data > is not properly cleaned up, yet, which is covered by FLIP-270 [2]. > > It would be interesting to know why these artifacts haven't been deleted > assuming that the corresponding job is actually in a final state (e.g. > FAILED, CANCELLED, FINISHED), i.e. there is a JobResultStoreEntry file for > that specific job available in the folder Gyula was referring to in the > linked documentation. At least for the JobGraph files, it's likely that you > have additional metadata still stored in your HA backend (that refers to > the files). That would be something you might want to clean up as well, if > you want to do a proper cleanup. But still, it would be good to understand > why these files are not cleaned up by Flink. > > Best, > Matthias > > [1] https://issues.apache.org/jira/browse/FLINK-27355 > [2] > https://cwiki.apache.org/confluence/display/FLINK/FLIP-270%3A+Repeatable+Cleanup+of+Checkpoints > > On Tue, Dec 6, 2022 at 5:42 PM Alexis Sarda-Espinosa < > sarda.espin...@gmail.com> wrote: > >> One concrete question, under the HA folder I also see these sample >> entries: >> >> - job_name/blob/job_uuid/blob_... >> - job_name/submittedJobGraphX >> - job_name/submittedJobGraphY >> >> Is it safe to clean these up when the job is in a healthy state? >> >> Regards, >> Alexis. >> >> Am Mo., 5. Dez. 2022 um 20:09 Uhr schrieb Alexis Sarda-Espinosa < >> sarda.espin...@gmail.com>: >> >>> Hi Gyula, >>> >>> that certainly helps, but to set up automatic cleanup (in my case, of >>> azure blob storage), the ideal option would be to be able to set a simple >>> policy that deletes blobs that haven't been updated in some time, but that >>> would assume that anything that's actually relevant for the latest state is >>> "touched" by the JM on every checkpoint, and since I also see blobs >>> referencing "submitted job graphs", I imagine that might not be a safe >>> assumption. >>> >>> I understand the life cycle of those blobs isn't directly managed by the >>> operator, but in that regard it could make things more cumbersome. >>> >>> Ideally, Flink itself would guarantee this sort of allowable TTL for HA >>> files, but I'm sure that's not trivial. >>> >>> Regards, >>> Alexis. >>> >>> On Mon, 5 Dec 2022, 19:19 Gyula Fóra, <gyula.f...@gmail.com> wrote: >>> >>>> Hi! >>>> >>>> There are some files that are not cleaned up over time in the HA dir >>>> that need to be cleaned up by the user: >>>> >>>> >>>> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/concepts/overview/#jobresultstore-resource-leak >>>> >>>> >>>> Hope this helps >>>> Gyula >>>> >>>> On Mon, 5 Dec 2022 at 11:56, Alexis Sarda-Espinosa < >>>> sarda.espin...@gmail.com> wrote: >>>> >>>>> Hello, >>>>> >>>>> I see the number of entries in the directory configured for HA >>>>> increases over time, particularly in the context of job upgrades in a >>>>> Kubernetes environment managed by the operator. Would it be safe to assume >>>>> that any files that haven't been updated in a while can be deleted? >>>>> Assuming the checkpointing interval is much smaller than the period used >>>>> to >>>>> determine if files are too old. >>>>> >>>>> Regards, >>>>> Alexis. >>>>> >>>>>