Hi Han, Thanks for the proposal. Faster Checkpoint & Recovery lays the groundwork for Disaggregated State to adapt to cloud-native deployment. Regarding the FLIP, I have three comments:
1. Are there any preliminary evaluation results available for this feature? 2. In terms of compatibility, can this feature be enabled using an existing original checkpoint or a native savepoint? 3. Does this feature introduce any additional overhead? Han Yin <alexyin...@gmail.com> 于2025年2月21日周五 19:00写道: > > Hi Zakelly, > Thanks for your response! > 1. Sure. I’ve added a Section called ‘End-to-end user case’ after the section > ‘Overview’. > 2. Yes, because reusing files somewhat goes against the semantics of a full > checkpoint. If full-checkpoint is enforced, the FileTransferStrategy will > enforce the files to be transferred by copying instead by reusing. > 3. Yes. The changes happen all under the ForStStateBackend. I’ve updated the > Section in the FLIP. > 4. In fact, we don't need much special file handling for checkpoint failures, > as they are managed by ForSt’s snapshot strategy. The proposed > FileTransferStrategy only checks whether the files are successfully > transferred. If the transfer is unsuccessful, it throws an exception, > ultimately failing the checkpoint. If the transfer succeed but the > checkpoint is aborted, since the file is already 'uploaded' to the checkpoint > directory, it is no longer owned by the DB, and the snapshot strategy will > skip re-uploading it for subsequent checkpoints. > > > 2025年2月17日 11:44,Zakelly Lan <zakelly....@gmail.com> 写道: > > > > Hi Han, > > > > Thanks for driving this! > > > > The FLIP is in good shape, here are my comments: > > > > 1. The FLIP introduces the file reusing during snapshot and recovery. Could > > you please provide some common use cases from the user's perspective? e.g. > > Periodic checkpoint, native savepoint. > > 2. Does the current design depend on the incremental checkpoint? If we > > enforce the full checkpoint, then what happened? > > 3. Will all the proposed changes be under the ForStStateBackend? It is > > better to emphasize this in 'Proposed Changes' > > 4. Is there any special file handling for checkpoint failure? > > > > > > Best, > > Zakelly > > > > > > On Fri, Feb 14, 2025 at 6:35 PM Han Yin <alexyin...@gmail.com> wrote: > > > >> Hi everyone, > >> > >> I would like to open a discussion on implementing faster checkpoint & > >> recovery for disaggregated state[1]. > >> > >> This is an improvement work for the disaggregated state management ForSt, > >> so you may want to read FLIP-423[2] and FLIP-428[3] to know the > >> backgrounds. > >> > >> Currently, ForSt copies or fast-duplicates files between the working > >> directory and the checkpoint directory during checkpointing and > >> restoration. However, in a disaggregated environment, there is no need to > >> maintain multiple copies of files since they typically reside within the > >> same remote file system. Therefore, we propose an approach for reusing > >> files when ForSt generates snapshots or restores from checkpoints and for > >> managing the file ownership between Flink & ForSt. By eliminating the > >> overhead of file copying, checkpointing & restoration & rescaling can > >> become significantly faster for disaggregated state. > >> > >> Looking forward to your comments or feedback. Best regards, > >> Han Yin > >> > >> [1] > >> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=293046898 > >> < > >> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=293046898 > >>> > >> [2] > >> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=293046855 > >> < > >> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=293046855 > >>> > >> [3] > >> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=293046865 > >> < > >> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=293046865 > >>> > >> > >> > >> > >> > -- Best, Yanfei