Re: [DISCUSS] FLIP-428: Fault Tolerance/Rescale Integration for Disaggregated State

Jinzhong Li Sun, 31 Mar 2024 20:06:34 -0700

Hi Yun,

Thanks for your advice. I have added the topic of remote working files
cleanup to the current FLIP.


Best,
Jinzhong

On Sat, Mar 30, 2024 at 10:44 AM Yun Tang <myas...@live.com> wrote:

> Hi Jinzhong,
>
> Yes, I know the cleanup mechanism for the remote working directory is the
> same as the current Rocksdb state-backend. However, the impact of the
> residual files in the remote working directory is different compared with
> residual files in the local directory, especially Flink just try the best
> to clean up during stateBackend#dispose.
>
> I agree that we could leave the optimization in the future FLIP, however,
> I think we should mention this topic in the current FLIP to make the
> overall design more complete and sophisticated.
>
>
> Best
> Yun Tang
> ________________________________
> From: Jinzhong Li <lijinzhong2...@gmail.com>
> Sent: Thursday, March 28, 2024 12:45
> To: dev@flink.apache.org <dev@flink.apache.org>
> Subject: Re: [DISCUSS] FLIP-428: Fault Tolerance/Rescale Integration for
> Disaggregated State
>
> Hi Feifan,
>
> Sorry for the misunderstanding. As Hangxiang explained, the basic cleanup
> mechanism for remote working directory is the same as rocksdb-statebackend,
> that is, when TM exits, forst-statebackend will delete the entire working
> dir. Regarding orphaned files cleanup in the case of TM crash, we will
> address it in the future FLIP.
>
> Best,
> Jinzhong
>
> On Thu, Mar 28, 2024 at 12:35 PM Hangxiang Yu <master...@gmail.com> wrote:
>
> > Hi, Yun and Feifan.
> >
> > Thanks for your reply.
> >
> > About the cleanup of working dir, as mentioned in FLIP-427, "The life
> cycle
> > of working dir is managed as before local strategy.".
> > Since the current working dir and checkpoint dir are separate, The life
> > cycle including creating and cleanup of working dir could be aligned with
> > before easily.
> >
> > On Thu, Mar 28, 2024 at 12:07 PM Feifan Wang <zoltar9...@163.com> wrote:
> >
> > > And I think the cleanup of working dir should be discussion in
> > FLIP-427[1]
> > > ( this mail list [2]) ?
> > >
> > >
> > > [1] https://cwiki.apache.org/confluence/x/T4p3EQ
> > > [2] https://lists.apache.org/thread/vktfzqvb7t4rltg7fdlsyd9sfdmrc4ft
> > >
> > > ——————————————
> > >
> > > Best regards,
> > >
> > > Feifan Wang
> > >
> > >
> > >
> > >
> > > At 2024-03-28 11:56:22, "Feifan Wang" <zoltar9...@163.com> wrote:
> > > >Hi Jinzhong :
> > > >
> > > >
> > > >> I suggest that we could postpone this topic for now and consider it
> > > comprehensively combined with the TM ownership file management in the
> > > future FLIP.
> > > >
> > > >
> > > >Sorry I still think we should consider the cleanup of the working dir
> in
> > > this FLIP, although we may come up with a better solution in a
> subsequent
> > > flip, I think it is important to maintain the integrity of the current
> > > changes. Otherwise we may suffer from wasted DFS space for some time.
> > > >Perhaps we only need a simple cleanup strategy at this stage, such as
> > > proactive cleanup when TM exits. While this may fail in the case of a
> TM
> > > crash, it already alleviates the problem.
> > > >
> > > >
> > > >
> > > >
> > > >——————————————
> > > >
> > > >Best regards,
> > > >
> > > >Feifan Wang
> > > >
> > > >
> > > >
> > > >
> > > >At 2024-03-28 11:15:11, "Jinzhong Li" <lijinzhong2...@gmail.com>
> wrote:
> > > >>Hi Yun,
> > > >>
> > > >>Thanks for your reply.
> > > >>
> > > >>> 1. Why must we have another 'subTask-checkpoint-sub-dir'
> > > >>> under the shared directory? if we don't consider making
> > > >>> TM ownership in this FLIP, this design seems unnecessary.
> > > >>
> > > >> Good catch! We will not change the directory layout of shared
> > directory
> > > in
> > > >>this FLIP. I have already removed this part from this FLIP. I think
> we
> > > >>could revisit this topic in a future FLIP about TM ownership.
> > > >>
> > > >>> 2. This FLIP forgets to mention the cleanup of the remote
> > > >>> working directory in case of the taskmanager crushes,
> > > >>> even though this is an open problem, we can still leave
> > > >>> some space for future optimization.
> > > >>
> > > >>Considering that we have plans to merge TM working dir and checkpoint
> > dir
> > > >>into one directory, I suggest that we could postpone this topic for
> now
> > > and
> > > >>consider it comprehensively combined with the TM ownership file
> > > management
> > > >>in the future FLIP.
> > > >>
> > > >>Best,
> > > >>Jinzhong
> > > >>
> > > >>
> > > >>
> > > >>On Wed, Mar 27, 2024 at 11:49 PM Yun Tang <myas...@live.com> wrote:
> > > >>
> > > >>> Hi Jinzhong,
> > > >>>
> > > >>> The overall design looks good.
> > > >>>
> > > >>> I have two minor questions:
> > > >>>
> > > >>> 1. Why must we have another 'subTask-checkpoint-sub-dir' under the
> > > shared
> > > >>> directory? if we don't consider making TM ownership in this FLIP,
> > this
> > > >>> design seems unnecessary.
> > > >>> 2. This FLIP forgets to mention the cleanup of the remote working
> > > >>> directory in case of the taskmanager crushes, even though this is
> an
> > > open
> > > >>> problem, we can still leave some space for future optimization.
> > > >>>
> > > >>> Best,
> > > >>> Yun Tang
> > > >>>
> > > >>> ________________________________
> > > >>> From: Jinzhong Li <lijinzhong2...@gmail.com>
> > > >>> Sent: Monday, March 25, 2024 10:41
> > > >>> To: dev@flink.apache.org <dev@flink.apache.org>
> > > >>> Subject: Re: [DISCUSS] FLIP-428: Fault Tolerance/Rescale
> Integration
> > > for
> > > >>> Disaggregated State
> > > >>>
> > > >>> Hi Yue,
> > > >>>
> > > >>> Thanks for your comments.
> > > >>>
> > > >>> The CURRENT is a special file that points to the latest manifest
> log
> > > >>> file. As Zakelly explained above, we could record the latest
> manifest
> > > >>> filename during sync phase, and write the filename into CURRENT
> > > snapshot
> > > >>> file during async phase.
> > > >>>
> > > >>> Best,
> > > >>> Jinzhong
> > > >>>
> > > >>> On Fri, Mar 22, 2024 at 11:16 PM Zakelly Lan <
> zakelly....@gmail.com>
> > > >>> wrote:
> > > >>>
> > > >>> > Hi Yue,
> > > >>> >
> > > >>> > Thanks for bringing this up!
> > > >>> >
> > > >>> > The CURRENT FILE is the special one, which should be snapshot
> > during
> > > the
> > > >>> > sync phase (temporary load into memory). Thus we can solve this.
> > > >>> >
> > > >>> >
> > > >>> > Best,
> > > >>> > Zakelly
> > > >>> >
> > > >>> > On Fri, Mar 22, 2024 at 4:55 PM yue ma <mayuefi...@gmail.com>
> > wrote:
> > > >>> >
> > > >>> > > Hi jinzhong,
> > > >>> > > Thanks for you reply. I still have some doubts about the first
> > > >>> question.
> > > >>> > Is
> > > >>> > > there such a case
> > > >>> > > When you made a snapshot during the synchronization phase, you
> > > recorded
> > > >>> > the
> > > >>> > > current and manifest 8, but before asynchronous phase, the
> > manifest
> > > >>> > reached
> > > >>> > > the size threshold and then the CURRENT FILE pointed to the new
> > > >>> manifest
> > > >>> > 9,
> > > >>> > > and then uploaded the incorrect CURRENT file ?
> > > >>> > >
> > > >>> > > Jinzhong Li <lijinzhong2...@gmail.com> 于2024年3月20日周三 20:13写道：
> > > >>> > >
> > > >>> > > > Hi Yue,
> > > >>> > > >
> > > >>> > > > Thanks for your feedback!
> > > >>> > > >
> > > >>> > > > > 1. If we choose Option-3 for ForSt , how would we handle
> > > Manifest
> > > >>> > File
> > > >>> > > > > ? Should we take a snapshot of the Manifest during the
> > > >>> > synchronization
> > > >>> > > > phase?
> > > >>> > > >
> > > >>> > > > IIUC, the GetLiveFiles() API in Option-3 can also catch the
> > > fileInfo
> > > >>> of
> > > >>> > > > Manifest files, and this api also return the manifest file
> > size,
> > > >>> which
> > > >>> > > > means this api could take snapshot for Manifest FileInfo
> > > (filename +
> > > >>> > > > fileSize) during the synchronization phase.
> > > >>> > > > You could refer to the rocksdb source code[1] to verify this.
> > > >>> > > >
> > > >>> > > >
> > > >>> > > >  > However, many distributed storage systems do not support
> the
> > > >>> > > > > ability of Fast Duplicate (such as HDFS). But ForSt has the
> > > ability
> > > >>> > to
> > > >>> > > > > directly read and write remote files. Can we not copy or
> Fast
> > > >>> > duplicate
> > > >>> > > > > these files, but instand of directly reuse and. reference
> > these
> > > >>> > remote
> > > >>> > > > > files? I think this can reduce file download time and may
> be
> > > more
> > > >>> > > useful
> > > >>> > > > > for most users who use HDFS (do not support Fast
> Duplicate)?
> > > >>> > > >
> > > >>> > > > Firstly, as far as I know, most remote file systems support
> the
> > > >>> > > > FastDuplicate, eg. S3 copyObject/Azure Blob Storage
> > copyBlob/OSS
> > > >>> > > > copyObject, and the HDFS indeed does not support
> FastDuplicate.
> > > >>> > > >
> > > >>> > > > Actually，we have considered the design which reuses remote
> > > files. And
> > > >>> > > that
> > > >>> > > > is what we want to implement in the coming future, where both
> > > >>> > checkpoints
> > > >>> > > > and restores can reuse existing files residing on the remote
> > > state
> > > >>> > > storage.
> > > >>> > > > However, this design conflicts with the current file
> management
> > > >>> system
> > > >>> > in
> > > >>> > > > Flink.  At present, remote state files are managed by the
> > ForStDB
> > > >>> > > > (TaskManager side), while checkpoint files are managed by the
> > > >>> > JobManager,
> > > >>> > > > which is a major hindrance to file reuse. For example, issues
> > > could
> > > >>> > arise
> > > >>> > > > if a TM reuses a checkpoint file that is subsequently deleted
> > by
> > > the
> > > >>> > JM.
> > > >>> > > > Therefore, as mentioned in FLIP-423[2], our roadmap is to
> first
> > > >>> > integrate
> > > >>> > > > checkpoint/restore mechanisms with existing framework  at
> > > >>> milestone-1.
> > > >>> > > > Then, at milestone-2, we plan to introduce TM State Ownership
> > and
> > > >>> > Faster
> > > >>> > > > Checkpointing mechanisms, which will allow both checkpointing
> > and
> > > >>> > > restoring
> > > >>> > > > to directly reuse remote files, thus achieving faster
> > > checkpointing
> > > >>> and
> > > >>> > > > restoring.
> > > >>> > > >
> > > >>> > > > [1]
> > > >>> > > >
> > > >>> > > >
> > > >>> > >
> > > >>> >
> > > >>>
> > >
> >
> https://github.com/facebook/rocksdb/blob/6ddfa5f06140c8d0726b561e16dc6894138bcfa0/db/db_filesnapshot.cc#L77
> > > >>> > > > [2]
> > > >>> > > >
> > > >>> > > >
> > > >>> > >
> > > >>> >
> > > >>>
> > >
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=293046855#FLIP423:DisaggregatedStateStorageandManagement(UmbrellaFLIP)-RoadMap+LaunchingPlan
> > > >>> > > >
> > > >>> > > > Best,
> > > >>> > > > Jinzhong
> > > >>> > > >
> > > >>> > > >
> > > >>> > > >
> > > >>> > > >
> > > >>> > > >
> > > >>> > > >
> > > >>> > > >
> > > >>> > > > On Wed, Mar 20, 2024 at 4:01 PM yue ma <mayuefi...@gmail.com
> >
> > > wrote:
> > > >>> > > >
> > > >>> > > > > Hi Jinzhong
> > > >>> > > > >
> > > >>> > > > > Thank you for initiating this FLIP.
> > > >>> > > > >
> > > >>> > > > > I have just some minor question:
> > > >>> > > > >
> > > >>> > > > > 1. If we choice Option-3 for ForSt , how would we handle
> > > Manifest
> > > >>> > File
> > > >>> > > > > ? Should we take snapshot of the Manifest during the
> > > >>> synchronization
> > > >>> > > > phase?
> > > >>> > > > > Otherwise, may the Manifest and MetaInfo information be
> > > >>> inconsistent
> > > >>> > > > during
> > > >>> > > > > recovery?
> > > >>> > > > > 2. For the Restore Operation , we need Fast Duplicate
> > > Checkpoint
> > > >>> > Files
> > > >>> > > > to
> > > >>> > > > > Working Dir . However, many distributed storage systems do
> > not
> > > >>> > support
> > > >>> > > > the
> > > >>> > > > > ability of Fast Duplicate (such as HDFS). But ForSt has the
> > > ability
> > > >>> > to
> > > >>> > > > > directly read and write remote files. Can we not copy or
> Fast
> > > >>> > duplicate
> > > >>> > > > > these files, but instand of directly reuse and. reference
> > these
> > > >>> > remote
> > > >>> > > > > files? I think this can reduce file download time and may
> be
> > > more
> > > >>> > > useful
> > > >>> > > > > for most users who use HDFS (do not support Fast
> Duplicate)?
> > > >>> > > > >
> > > >>> > > > > --
> > > >>> > > > > Best,
> > > >>> > > > > Yue
> > > >>> > > > >
> > > >>> > > >
> > > >>> > >
> > > >>> > >
> > > >>> > > --
> > > >>> > > Best,
> > > >>> > > Yue
> > > >>> > >
> > > >>> >
> > > >>>
> > >
> >
> >
> > --
> > Best,
> > Hangxiang.
> >
>

Re: [DISCUSS] FLIP-428: Fault Tolerance/Rescale Integration for Disaggregated State

Reply via email to