Re: [DISCUSS] FLIP-428: Fault Tolerance/Rescale Integration for Disaggregated State

Jinzhong Li Thu, 21 Mar 2024 20:45:21 -0700

Hi Jeyhun,

Thanks for your thoughtful feedback!


> Why dont we consider an option where checkpoint directory just contains
> metadata. So, we do not need to copy the data all the time from working
> directory to the checkpointing directory.
> Basically, when checkpointing, 1) we mark files in working directories as
> "read-only", 2) optionally change the working directory, and 3) update the
> metadata in the checkpoint directory (that points to some files in the
> working directory).

The method you described is essentially consistent with the "Mid/long term
follow up work"[1] outlined in this FLIP.
Under this approach, the ForStDB (TM side) needs to manage the lifecycle of
both state files and checkpoint files, so that the checkpoint file can
reuse state db files. For instance, files that are still referenced by the
checkpoint but no longer used by ForStDB should be guaranteed not to be
deleted by ForStDB.
However, this way conflicts with the current mechanism where the JobManager
manages the lifecycle of checkpoint files, and we need to refine it step by
step. As outlined in flip-423[2], we will introduce this "zero-copy" faster
checkpointing & restoring at milestorn-2.

[1]
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=293046865#FLIP428:FaultTolerance/RescaleIntegrationforDisaggregatedState-Mid/longtermfollowupwork
[2]
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=293046855#FLIP423:DisaggregatedStateStorageandManagement(UmbrellaFLIP)-RoadMap+LaunchingPlan

> This is more of a technical question about RocksDB. Do we have a
> guarantee that when calling DB.GetLiveFiles(), the WAL will be emptied as
> well?

I think we don't need to consider WAL; even we can disable WAL. The reasons
are:
(1) we will set flush_memtable=true when DB.GetLiveFiles() invoked,
ensuring that MemTable data could also be persisted.
(2) DB.GetLiveFiles() is called during the snapshot synchronization phase,
during which there are no concurrent state read/write operations;

> As far as I understand, DB.GetLiveFiles() retrieves the global mutex
> lock. I am wondering if RocksDBs optimistic transactions can be any of
help
> in this situation?

As mentioned above, there is no concurrent read/write during
GetLiveFiles(), and since GetLiveFiles() is a pure memory operation, I'm
not particularly worried about the mutex lock impacting performance.

Best,
Jinzhong

On Fri, Mar 22, 2024 at 6:36 AM Jeyhun Karimov <je.kari...@gmail.com> wrote:

> Hi Jinzhong,
>
> Thanks for the FLIP. +1 for it.
>
> I have a few questions:
>
> - Why dont we consider an option where checkpoint directory just contains
> metadata. So, we do not need to copy the data all the time from working
> directory to the checkpointing directory.
> Basically, when checkpointing, 1) we mark files in working directories as
> "read-only", 2) optionally change the working directory, and 3) update the
> metadata in the checkpoint directory (that points to some files in the
> working directory).
>
> - This is more of a technical question about RocksDB. Do we have a
> guarantee that when calling DB.GetLiveFiles(), the WAL will be emptied as
> well?
>
> - As far as I understand, DB.GetLiveFiles() retrieves the global mutex
> lock. I am wondering if RocksDBs optimistic transactions can be any of help
> in this situation?
>
> Regards,
> Jeyhun
>
> On Wed, Mar 20, 2024 at 1:35 PM Jinzhong Li <lijinzhong2...@gmail.com>
> wrote:
>
> > Hi Yue,
> >
> > Thanks for your feedback!
> >
> > > 1. If we choose Option-3 for ForSt , how would we handle Manifest File
> > > ? Should we take a snapshot of the Manifest during the synchronization
> > phase?
> >
> > IIUC, the GetLiveFiles() API in Option-3 can also catch the fileInfo of
> > Manifest files, and this api also return the manifest file size, which
> > means this api could take snapshot for Manifest FileInfo (filename +
> > fileSize) during the synchronization phase.
> > You could refer to the rocksdb source code[1] to verify this.
> >
> >
> >  > However, many distributed storage systems do not support the
> > > ability of Fast Duplicate (such as HDFS). But ForSt has the ability to
> > > directly read and write remote files. Can we not copy or Fast duplicate
> > > these files, but instand of directly reuse and. reference these remote
> > > files? I think this can reduce file download time and may be more
> useful
> > > for most users who use HDFS (do not support Fast Duplicate)?
> >
> > Firstly, as far as I know, most remote file systems support the
> > FastDuplicate, eg. S3 copyObject/Azure Blob Storage copyBlob/OSS
> > copyObject, and the HDFS indeed does not support FastDuplicate.
> >
> > Actually，we have considered the design which reuses remote files. And
> that
> > is what we want to implement in the coming future, where both checkpoints
> > and restores can reuse existing files residing on the remote state
> storage.
> > However, this design conflicts with the current file management system in
> > Flink.  At present, remote state files are managed by the ForStDB
> > (TaskManager side), while checkpoint files are managed by the JobManager,
> > which is a major hindrance to file reuse. For example, issues could arise
> > if a TM reuses a checkpoint file that is subsequently deleted by the JM.
> > Therefore, as mentioned in FLIP-423[2], our roadmap is to first integrate
> > checkpoint/restore mechanisms with existing framework  at milestone-1.
> > Then, at milestone-2, we plan to introduce TM State Ownership and Faster
> > Checkpointing mechanisms, which will allow both checkpointing and
> restoring
> > to directly reuse remote files, thus achieving faster checkpointing and
> > restoring.
> >
> > [1]
> >
> >
> https://github.com/facebook/rocksdb/blob/6ddfa5f06140c8d0726b561e16dc6894138bcfa0/db/db_filesnapshot.cc#L77
> > [2]
> >
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=293046855#FLIP423:DisaggregatedStateStorageandManagement(UmbrellaFLIP)-RoadMap+LaunchingPlan
> >
> > Best,
> > Jinzhong
> >
> >
> >
> >
> >
> >
> >
> > On Wed, Mar 20, 2024 at 4:01 PM yue ma <mayuefi...@gmail.com> wrote:
> >
> > > Hi Jinzhong
> > >
> > > Thank you for initiating this FLIP.
> > >
> > > I have just some minor question:
> > >
> > > 1. If we choice Option-3 for ForSt , how would we handle Manifest File
> > > ? Should we take snapshot of the Manifest during the synchronization
> > phase?
> > > Otherwise, may the Manifest and MetaInfo information be inconsistent
> > during
> > > recovery?
> > > 2. For the Restore Operation , we need Fast Duplicate  Checkpoint Files
> > to
> > > Working Dir . However, many distributed storage systems do not support
> > the
> > > ability of Fast Duplicate (such as HDFS). But ForSt has the ability to
> > > directly read and write remote files. Can we not copy or Fast duplicate
> > > these files, but instand of directly reuse and. reference these remote
> > > files? I think this can reduce file download time and may be more
> useful
> > > for most users who use HDFS (do not support Fast Duplicate)?
> > >
> > > --
> > > Best,
> > > Yue
> > >
> >
>

Re: [DISCUSS] FLIP-428: Fault Tolerance/Rescale Integration for Disaggregated State

Reply via email to