Re: [DISCUSS] FLIP-432: Faster Checkpoint & Recovery for Disaggregated State

Yanfei Lei Wed, 19 Mar 2025 01:36:50 -0700

Hi Han,

Thanks for the proposal.
Faster Checkpoint & Recovery lays the groundwork for Disaggregated
State to adapt to cloud-native deployment. Regarding the FLIP, I have
three comments:


1. Are there any preliminary evaluation results available for this feature?
2. In terms of compatibility, can this feature be enabled using an
existing original checkpoint or a native savepoint?
3. Does this feature introduce any additional overhead?

Han Yin <[email protected]> 于2025年2月21日周五 19:00写道：

>
> Hi Zakelly,
> Thanks for your response!
> 1. Sure. I’ve added a Section called ‘End-to-end user case’ after the section 
> ‘Overview’.
> 2. Yes, because reusing files somewhat goes against the semantics of a full 
> checkpoint. If full-checkpoint is enforced, the FileTransferStrategy will 
> enforce the files to be transferred by copying instead by reusing.
> 3. Yes. The changes happen all under  the ForStStateBackend. I’ve updated the 
> Section in the FLIP.
> 4. In fact, we don't need much special file handling for checkpoint failures, 
> as they are managed by ForSt’s snapshot strategy. The proposed 
> FileTransferStrategy only checks whether the files are successfully 
> transferred. If the transfer is unsuccessful, it throws an exception, 
> ultimately failing the checkpoint.  If the transfer succeed but the 
> checkpoint is aborted, since the file is already 'uploaded' to the checkpoint 
> directory, it is no longer owned by the DB, and the snapshot strategy will 
> skip re-uploading it for subsequent checkpoints.
>
> > 2025年2月17日 11:44，Zakelly Lan <[email protected]> 写道：
> >
> > Hi Han,
> >
> > Thanks for driving this!
> >
> > The FLIP is in good shape, here are my comments:
> >
> > 1. The FLIP introduces the file reusing during snapshot and recovery. Could
> > you please provide some common use cases from the user's perspective? e.g.
> > Periodic checkpoint, native savepoint.
> > 2. Does the current design depend on the incremental checkpoint? If we
> > enforce the full checkpoint, then what happened?
> > 3. Will all the proposed changes be under the ForStStateBackend? It is
> > better to emphasize this in 'Proposed Changes'
> > 4. Is there any special file handling for checkpoint failure?
> >
> >
> > Best,
> > Zakelly
> >
> >
> > On Fri, Feb 14, 2025 at 6:35 PM Han Yin <[email protected]> wrote:
> >
> >> Hi everyone,
> >>
> >> I would like to open a discussion on implementing faster checkpoint &
> >> recovery for disaggregated state[1].
> >>
> >> This is an improvement work for the disaggregated state management ForSt,
> >> so you may want to read FLIP-423[2] and FLIP-428[3] to know the 
> >> backgrounds.
> >>
> >> Currently, ForSt copies or fast-duplicates files between the working
> >> directory and the checkpoint directory during checkpointing and
> >> restoration. However, in a disaggregated environment, there is no need to
> >> maintain multiple copies of files since they typically reside within the
> >> same remote file system. Therefore, we propose an approach for reusing
> >> files when ForSt generates snapshots or restores from checkpoints and for
> >> managing the file ownership between Flink & ForSt. By eliminating the
> >> overhead of file copying, checkpointing & restoration & rescaling can
> >> become significantly faster for disaggregated state.
> >>
> >> Looking forward to your comments or feedback.  Best regards,
> >> Han Yin
> >>
> >> [1]
> >> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=293046898
> >> <
> >> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=293046898
> >>>
> >> [2]
> >> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=293046855
> >> <
> >> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=293046855
> >>>
> >> [3]
> >> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=293046865
> >> <
> >> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=293046865
> >>>
> >>
> >>
> >>
> >>
>


--
Best,
Yanfei

Re: [DISCUSS] FLIP-432: Faster Checkpoint & Recovery for Disaggregated State

Reply via email to