[ https://issues.apache.org/jira/browse/FLINK-24149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17415322#comment-17415322 ]
Piotr Nowojski commented on FLINK-24149: ---------------------------------------- {quote} Regarding first checkpoint after new job instance restored from a retained checkpoint can't be incremental, I think this can be optimized in some ways. {quote} We also want to address this issue, at least in some scenarios. One option that we are considering is that newly started job from a savepoint/retained checkpoint could claim full ownership of the files from where it started, which would allow that job to have even first checkpoint fully incremental. {quote} Hi Piotr Nowojski , Is there any documentation for this WIP FLIP ?I am very interested. {quote} It's not yet published. I hope it will be published in the following weeks on the dev mailinglist. If I don't forget, I will cross reference it here once it posted :) > Make checkpoint self-contained and relocatable > ---------------------------------------------- > > Key: FLINK-24149 > URL: https://issues.apache.org/jira/browse/FLINK-24149 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing > Reporter: Feifan Wang > Priority: Major > Labels: pull-request-available > Attachments: image-2021-09-08-17-06-31-560.png, > image-2021-09-08-17-10-28-240.png, image-2021-09-08-17-55-46-898.png, > image-2021-09-08-18-01-03-176.png, image-2021-09-14-14-22-31-537.png > > > h1. Backgroud > We have many jobs with large state size in production environment. According > to the operation practice of these jobs and the analysis of some specific > problems, we believe that RocksDBStateBackend's incremental checkpoint has > many advantages over savepoint: > # Savepoint takes much longer time then incremental checkpoint in jobs with > large state. The figure below is a job in our production environment, it > takes nearly 7 minutes to complete a savepoint, while checkpoint only takes a > few seconds.( checkpoint after savepoint takes longer time is a problem > described in -FLINK-23949-) > !image-2021-09-08-17-55-46-898.png|width=723,height=161! > # Savepoint causes excessive cpu usage. The figure below shows the CPU usage > of the same job in the above figure : > !image-2021-09-08-18-01-03-176.png|width=516,height=148! > # Savepoint may cause excessive native memory usage and eventually cause the > TaskManager process memory usage to exceed the limit. (We did not further > investigate the cause and did not try to reproduce the problem on other large > state jobs, but only increased the overhead memory. So this reason may not be > so conclusive. ) > For the above reasons, we tend to use retained incremental checkpoint to > completely replace savepoint for jobs with large state size. > h1. Problems > * *Problem 1 : retained incremental checkpoint difficult to clean up once > they used for recovery* > This problem caused by jobs recoveryed from a retained incremental checkpoint > may reference files on this retained incremental checkpoint's shared > directory in subsequent checkpoints, even they are not in a same job > instance. The worst case is that the retained checkpoint will be referenced > one by one, forming a very long reference chain.This makes it difficult for > users to manage retained checkpoints. In fact, we have also suffered failures > caused by incorrect deletion of retained checkpoints. > Although we can use the file handle in checkpoint metadata to figure out > which files can be deleted, but I think it is inappropriate to let users do > this. > * *Problem 2 : checkpoint not relocatable* > Even if we can figure out all files referenced by a checkpoint, moving these > files will invalidate the checkpoint as well, because the metadata file > references absolute file paths. > Since savepoint already be self-contained and relocatable (FLINK-5763), why > don't we use savepoint just for migrate jobs to another place ? In addition > to the savepoint performance problem in the background description, a very > important reason is that the migration requirement may come from the failure > of the original cluster. In this case, there is no opportunity to trigger > savepoint. > h1. Proposal > * *job's checkpoint directory (user-defined-checkpoint-dir/<jobId>) contains > all their state files (self-contained)* > As far as I know, in the current status, only the subsequent checkpoints of > the jobs restored from the retained checkpoint violate this constraint. One > possible solution is to re-upload all shared files at the first incremental > checkpoint after the job started, but we need to discuss how to distinguish > between a new job instance and a restart. > * *use relative file path in checkpoint metadata (relocatable)* > Change all file references in checkpoint metadata to the relative path > relative to the _metadata file, so we can copy > user-defined-checkpoint-dir/<jobId> to any other place. > > BTW, this issue is so similar to FLINK-5763 , we can read it as a background > supplement. -- This message was sent by Atlassian Jira (v8.3.4#803005)