[ https://issues.apache.org/jira/browse/FLINK-24149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17414958#comment-17414958 ]
Piotr Nowojski commented on FLINK-24149: ---------------------------------------- > BTW, FLIP-47 marked checkpoint as self-contained , is that a mistake ? I believe so. However this FLIP-47 will be superseded by a newer design that's WIP. Regarding making paths relative. As [~yunta] I think mentioned, if checkpoint is referencing files from different job, that's not possible. So some of the paths would have to be absolute and we would need to have a logic of handling/deciding when the paths should be absolute/relative. We don't have anything like this, and it would be another potential place for problems in the future. As I have a feeling that this WIP FLIP should address your issues in more generic way, I would suggest to hold on with this proposed change, until this WIP FLIP is published. > Make checkpoint self-contained and relocatable > ---------------------------------------------- > > Key: FLINK-24149 > URL: https://issues.apache.org/jira/browse/FLINK-24149 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing > Reporter: Feifan Wang > Priority: Major > Labels: pull-request-available > Attachments: image-2021-09-08-17-06-31-560.png, > image-2021-09-08-17-10-28-240.png, image-2021-09-08-17-55-46-898.png, > image-2021-09-08-18-01-03-176.png, image-2021-09-14-14-22-31-537.png > > > h1. Backgroud > We have many jobs with large state size in production environment. According > to the operation practice of these jobs and the analysis of some specific > problems, we believe that RocksDBStateBackend's incremental checkpoint has > many advantages over savepoint: > # Savepoint takes much longer time then incremental checkpoint in jobs with > large state. The figure below is a job in our production environment, it > takes nearly 7 minutes to complete a savepoint, while checkpoint only takes a > few seconds.( checkpoint after savepoint takes longer time is a problem > described in -FLINK-23949-) > !image-2021-09-08-17-55-46-898.png|width=723,height=161! > # Savepoint causes excessive cpu usage. The figure below shows the CPU usage > of the same job in the above figure : > !image-2021-09-08-18-01-03-176.png|width=516,height=148! > # Savepoint may cause excessive native memory usage and eventually cause the > TaskManager process memory usage to exceed the limit. (We did not further > investigate the cause and did not try to reproduce the problem on other large > state jobs, but only increased the overhead memory. So this reason may not be > so conclusive. ) > For the above reasons, we tend to use retained incremental checkpoint to > completely replace savepoint for jobs with large state size. > h1. Problems > * *Problem 1 : retained incremental checkpoint difficult to clean up once > they used for recovery* > This problem caused by jobs recoveryed from a retained incremental checkpoint > may reference files on this retained incremental checkpoint's shared > directory in subsequent checkpoints, even they are not in a same job > instance. The worst case is that the retained checkpoint will be referenced > one by one, forming a very long reference chain.This makes it difficult for > users to manage retained checkpoints. In fact, we have also suffered failures > caused by incorrect deletion of retained checkpoints. > Although we can use the file handle in checkpoint metadata to figure out > which files can be deleted, but I think it is inappropriate to let users do > this. > * *Problem 2 : checkpoint not relocatable* > Even if we can figure out all files referenced by a checkpoint, moving these > files will invalidate the checkpoint as well, because the metadata file > references absolute file paths. > Since savepoint already be self-contained and relocatable (FLINK-5763), why > don't we use savepoint just for migrate jobs to another place ? In addition > to the savepoint performance problem in the background description, a very > important reason is that the migration requirement may come from the failure > of the original cluster. In this case, there is no opportunity to trigger > savepoint. > h1. Proposal > * *job's checkpoint directory (user-defined-checkpoint-dir/<jobId>) contains > all their state files (self-contained)* > As far as I know, in the current status, only the subsequent checkpoints of > the jobs restored from the retained checkpoint violate this constraint. One > possible solution is to re-upload all shared files at the first incremental > checkpoint after the job started, but we need to discuss how to distinguish > between a new job instance and a restart. > * *use relative file path in checkpoint metadata (relocatable)* > Change all file references in checkpoint metadata to the relative path > relative to the _metadata file, so we can copy > user-defined-checkpoint-dir/<jobId> to any other place. > > BTW, this issue is so similar to FLINK-5763 , we can read it as a background > supplement. -- This message was sent by Atlassian Jira (v8.3.4#803005)