Re: [DISCUSS] FLIP-428: Fault Tolerance/Rescale Integration for Disaggregated State

Yun Tang Wed, 27 Mar 2024 08:49:03 -0700

Hi Jinzhong,

The overall design looks good.


I have two minor questions:

1. Why must we have another 'subTask-checkpoint-sub-dir' under the shared 
directory? if we don't consider making TM ownership in this FLIP, this design 
seems unnecessary.
2. This FLIP forgets to mention the cleanup of the remote working directory in 
case of the taskmanager crushes, even though this is an open problem, we can 
still leave some space for future optimization.

Best,
Yun Tang

________________________________
From: Jinzhong Li <lijinzhong2...@gmail.com>
Sent: Monday, March 25, 2024 10:41
To: dev@flink.apache.org <dev@flink.apache.org>
Subject: Re: [DISCUSS] FLIP-428: Fault Tolerance/Rescale Integration for 
Disaggregated State

Hi Yue,

Thanks for your comments.

The CURRENT is a special file that points to the latest manifest log
file. As Zakelly explained above, we could record the latest manifest
filename during sync phase, and write the filename into CURRENT snapshot
file during async phase.

Best,
Jinzhong

On Fri, Mar 22, 2024 at 11:16 PM Zakelly Lan <zakelly....@gmail.com> wrote:

> Hi Yue,
>
> Thanks for bringing this up!
>
> The CURRENT FILE is the special one, which should be snapshot during the
> sync phase (temporary load into memory). Thus we can solve this.
>
>
> Best,
> Zakelly
>
> On Fri, Mar 22, 2024 at 4:55 PM yue ma <mayuefi...@gmail.com> wrote:
>
> > Hi jinzhong,
> > Thanks for you reply. I still have some doubts about the first question.
> Is
> > there such a case
> > When you made a snapshot during the synchronization phase, you recorded
> the
> > current and manifest 8, but before asynchronous phase, the manifest
> reached
> > the size threshold and then the CURRENT FILE pointed to the new manifest
> 9,
> > and then uploaded the incorrect CURRENT file ?
> >
> > Jinzhong Li <lijinzhong2...@gmail.com> 于2024年3月20日周三 20:13写道：
> >
> > > Hi Yue,
> > >
> > > Thanks for your feedback!
> > >
> > > > 1. If we choose Option-3 for ForSt , how would we handle Manifest
> File
> > > > ? Should we take a snapshot of the Manifest during the
> synchronization
> > > phase?
> > >
> > > IIUC, the GetLiveFiles() API in Option-3 can also catch the fileInfo of
> > > Manifest files, and this api also return the manifest file size, which
> > > means this api could take snapshot for Manifest FileInfo (filename +
> > > fileSize) during the synchronization phase.
> > > You could refer to the rocksdb source code[1] to verify this.
> > >
> > >
> > >  > However, many distributed storage systems do not support the
> > > > ability of Fast Duplicate (such as HDFS). But ForSt has the ability
> to
> > > > directly read and write remote files. Can we not copy or Fast
> duplicate
> > > > these files, but instand of directly reuse and. reference these
> remote
> > > > files? I think this can reduce file download time and may be more
> > useful
> > > > for most users who use HDFS (do not support Fast Duplicate)?
> > >
> > > Firstly, as far as I know, most remote file systems support the
> > > FastDuplicate, eg. S3 copyObject/Azure Blob Storage copyBlob/OSS
> > > copyObject, and the HDFS indeed does not support FastDuplicate.
> > >
> > > Actually，we have considered the design which reuses remote files. And
> > that
> > > is what we want to implement in the coming future, where both
> checkpoints
> > > and restores can reuse existing files residing on the remote state
> > storage.
> > > However, this design conflicts with the current file management system
> in
> > > Flink.  At present, remote state files are managed by the ForStDB
> > > (TaskManager side), while checkpoint files are managed by the
> JobManager,
> > > which is a major hindrance to file reuse. For example, issues could
> arise
> > > if a TM reuses a checkpoint file that is subsequently deleted by the
> JM.
> > > Therefore, as mentioned in FLIP-423[2], our roadmap is to first
> integrate
> > > checkpoint/restore mechanisms with existing framework  at milestone-1.
> > > Then, at milestone-2, we plan to introduce TM State Ownership and
> Faster
> > > Checkpointing mechanisms, which will allow both checkpointing and
> > restoring
> > > to directly reuse remote files, thus achieving faster checkpointing and
> > > restoring.
> > >
> > > [1]
> > >
> > >
> >
> https://github.com/facebook/rocksdb/blob/6ddfa5f06140c8d0726b561e16dc6894138bcfa0/db/db_filesnapshot.cc#L77
> > > [2]
> > >
> > >
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=293046855#FLIP423:DisaggregatedStateStorageandManagement(UmbrellaFLIP)-RoadMap+LaunchingPlan
> > >
> > > Best,
> > > Jinzhong
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Wed, Mar 20, 2024 at 4:01 PM yue ma <mayuefi...@gmail.com> wrote:
> > >
> > > > Hi Jinzhong
> > > >
> > > > Thank you for initiating this FLIP.
> > > >
> > > > I have just some minor question:
> > > >
> > > > 1. If we choice Option-3 for ForSt , how would we handle Manifest
> File
> > > > ? Should we take snapshot of the Manifest during the synchronization
> > > phase?
> > > > Otherwise, may the Manifest and MetaInfo information be inconsistent
> > > during
> > > > recovery?
> > > > 2. For the Restore Operation , we need Fast Duplicate  Checkpoint
> Files
> > > to
> > > > Working Dir . However, many distributed storage systems do not
> support
> > > the
> > > > ability of Fast Duplicate (such as HDFS). But ForSt has the ability
> to
> > > > directly read and write remote files. Can we not copy or Fast
> duplicate
> > > > these files, but instand of directly reuse and. reference these
> remote
> > > > files? I think this can reduce file download time and may be more
> > useful
> > > > for most users who use HDFS (do not support Fast Duplicate)?
> > > >
> > > > --
> > > > Best,
> > > > Yue
> > > >
> > >
> >
> >
> > --
> > Best,
> > Yue
> >
>

Re: [DISCUSS] FLIP-428: Fault Tolerance/Rescale Integration for Disaggregated State

Reply via email to