Re: [DISCUSS] FLIP-428: Fault Tolerance/Rescale Integration for Disaggregated State

Jeyhun Karimov Thu, 21 Mar 2024 15:36:20 -0700

Hi Jinzhong,

Thanks for the FLIP. +1 for it.


I have a few questions:

- Why dont we consider an option where checkpoint directory just contains
metadata. So, we do not need to copy the data all the time from working
directory to the checkpointing directory.
Basically, when checkpointing, 1) we mark files in working directories as
"read-only", 2) optionally change the working directory, and 3) update the
metadata in the checkpoint directory (that points to some files in the
working directory).

- This is more of a technical question about RocksDB. Do we have a
guarantee that when calling DB.GetLiveFiles(), the WAL will be emptied as
well?

- As far as I understand, DB.GetLiveFiles() retrieves the global mutex
lock. I am wondering if RocksDBs optimistic transactions can be any of help
in this situation?

Regards,
Jeyhun

On Wed, Mar 20, 2024 at 1:35 PM Jinzhong Li <[email protected]>
wrote:

> Hi Yue,
>
> Thanks for your feedback!
>
> > 1. If we choose Option-3 for ForSt , how would we handle Manifest File
> > ? Should we take a snapshot of the Manifest during the synchronization
> phase?
>
> IIUC, the GetLiveFiles() API in Option-3 can also catch the fileInfo of
> Manifest files, and this api also return the manifest file size, which
> means this api could take snapshot for Manifest FileInfo (filename +
> fileSize) during the synchronization phase.
> You could refer to the rocksdb source code[1] to verify this.
>
>
>  > However, many distributed storage systems do not support the
> > ability of Fast Duplicate (such as HDFS). But ForSt has the ability to
> > directly read and write remote files. Can we not copy or Fast duplicate
> > these files, but instand of directly reuse and. reference these remote
> > files? I think this can reduce file download time and may be more useful
> > for most users who use HDFS (do not support Fast Duplicate)?
>
> Firstly, as far as I know, most remote file systems support the
> FastDuplicate, eg. S3 copyObject/Azure Blob Storage copyBlob/OSS
> copyObject, and the HDFS indeed does not support FastDuplicate.
>
> Actually，we have considered the design which reuses remote files. And that
> is what we want to implement in the coming future, where both checkpoints
> and restores can reuse existing files residing on the remote state storage.
> However, this design conflicts with the current file management system in
> Flink.  At present, remote state files are managed by the ForStDB
> (TaskManager side), while checkpoint files are managed by the JobManager,
> which is a major hindrance to file reuse. For example, issues could arise
> if a TM reuses a checkpoint file that is subsequently deleted by the JM.
> Therefore, as mentioned in FLIP-423[2], our roadmap is to first integrate
> checkpoint/restore mechanisms with existing framework  at milestone-1.
> Then, at milestone-2, we plan to introduce TM State Ownership and Faster
> Checkpointing mechanisms, which will allow both checkpointing and restoring
> to directly reuse remote files, thus achieving faster checkpointing and
> restoring.
>
> [1]
>
> https://github.com/facebook/rocksdb/blob/6ddfa5f06140c8d0726b561e16dc6894138bcfa0/db/db_filesnapshot.cc#L77
> [2]
>
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=293046855#FLIP423:DisaggregatedStateStorageandManagement(UmbrellaFLIP)-RoadMap+LaunchingPlan
>
> Best,
> Jinzhong
>
>
>
>
>
>
>
> On Wed, Mar 20, 2024 at 4:01 PM yue ma <[email protected]> wrote:
>
> > Hi Jinzhong
> >
> > Thank you for initiating this FLIP.
> >
> > I have just some minor question:
> >
> > 1. If we choice Option-3 for ForSt , how would we handle Manifest File
> > ? Should we take snapshot of the Manifest during the synchronization
> phase?
> > Otherwise, may the Manifest and MetaInfo information be inconsistent
> during
> > recovery?
> > 2. For the Restore Operation , we need Fast Duplicate  Checkpoint Files
> to
> > Working Dir . However, many distributed storage systems do not support
> the
> > ability of Fast Duplicate (such as HDFS). But ForSt has the ability to
> > directly read and write remote files. Can we not copy or Fast duplicate
> > these files, but instand of directly reuse and. reference these remote
> > files? I think this can reduce file download time and may be more useful
> > for most users who use HDFS (do not support Fast Duplicate)?
> >
> > --
> > Best,
> > Yue
> >
>

Re: [DISCUSS] FLIP-428: Fault Tolerance/Rescale Integration for Disaggregated State

Reply via email to