Hi Yun, Thanks for your advice. I have added the topic of remote working files cleanup to the current FLIP.
Best, Jinzhong On Sat, Mar 30, 2024 at 10:44 AM Yun Tang <myas...@live.com> wrote: > Hi Jinzhong, > > Yes, I know the cleanup mechanism for the remote working directory is the > same as the current Rocksdb state-backend. However, the impact of the > residual files in the remote working directory is different compared with > residual files in the local directory, especially Flink just try the best > to clean up during stateBackend#dispose. > > I agree that we could leave the optimization in the future FLIP, however, > I think we should mention this topic in the current FLIP to make the > overall design more complete and sophisticated. > > > Best > Yun Tang > ________________________________ > From: Jinzhong Li <lijinzhong2...@gmail.com> > Sent: Thursday, March 28, 2024 12:45 > To: dev@flink.apache.org <dev@flink.apache.org> > Subject: Re: [DISCUSS] FLIP-428: Fault Tolerance/Rescale Integration for > Disaggregated State > > Hi Feifan, > > Sorry for the misunderstanding. As Hangxiang explained, the basic cleanup > mechanism for remote working directory is the same as rocksdb-statebackend, > that is, when TM exits, forst-statebackend will delete the entire working > dir. Regarding orphaned files cleanup in the case of TM crash, we will > address it in the future FLIP. > > Best, > Jinzhong > > On Thu, Mar 28, 2024 at 12:35 PM Hangxiang Yu <master...@gmail.com> wrote: > > > Hi, Yun and Feifan. > > > > Thanks for your reply. > > > > About the cleanup of working dir, as mentioned in FLIP-427, "The life > cycle > > of working dir is managed as before local strategy.". > > Since the current working dir and checkpoint dir are separate, The life > > cycle including creating and cleanup of working dir could be aligned with > > before easily. > > > > On Thu, Mar 28, 2024 at 12:07 PM Feifan Wang <zoltar9...@163.com> wrote: > > > > > And I think the cleanup of working dir should be discussion in > > FLIP-427[1] > > > ( this mail list [2]) ? > > > > > > > > > [1] https://cwiki.apache.org/confluence/x/T4p3EQ > > > [2] https://lists.apache.org/thread/vktfzqvb7t4rltg7fdlsyd9sfdmrc4ft > > > > > > —————————————— > > > > > > Best regards, > > > > > > Feifan Wang > > > > > > > > > > > > > > > At 2024-03-28 11:56:22, "Feifan Wang" <zoltar9...@163.com> wrote: > > > >Hi Jinzhong : > > > > > > > > > > > >> I suggest that we could postpone this topic for now and consider it > > > comprehensively combined with the TM ownership file management in the > > > future FLIP. > > > > > > > > > > > >Sorry I still think we should consider the cleanup of the working dir > in > > > this FLIP, although we may come up with a better solution in a > subsequent > > > flip, I think it is important to maintain the integrity of the current > > > changes. Otherwise we may suffer from wasted DFS space for some time. > > > >Perhaps we only need a simple cleanup strategy at this stage, such as > > > proactive cleanup when TM exits. While this may fail in the case of a > TM > > > crash, it already alleviates the problem. > > > > > > > > > > > > > > > > > > > >—————————————— > > > > > > > >Best regards, > > > > > > > >Feifan Wang > > > > > > > > > > > > > > > > > > > >At 2024-03-28 11:15:11, "Jinzhong Li" <lijinzhong2...@gmail.com> > wrote: > > > >>Hi Yun, > > > >> > > > >>Thanks for your reply. > > > >> > > > >>> 1. Why must we have another 'subTask-checkpoint-sub-dir' > > > >>> under the shared directory? if we don't consider making > > > >>> TM ownership in this FLIP, this design seems unnecessary. > > > >> > > > >> Good catch! We will not change the directory layout of shared > > directory > > > in > > > >>this FLIP. I have already removed this part from this FLIP. I think > we > > > >>could revisit this topic in a future FLIP about TM ownership. > > > >> > > > >>> 2. This FLIP forgets to mention the cleanup of the remote > > > >>> working directory in case of the taskmanager crushes, > > > >>> even though this is an open problem, we can still leave > > > >>> some space for future optimization. > > > >> > > > >>Considering that we have plans to merge TM working dir and checkpoint > > dir > > > >>into one directory, I suggest that we could postpone this topic for > now > > > and > > > >>consider it comprehensively combined with the TM ownership file > > > management > > > >>in the future FLIP. > > > >> > > > >>Best, > > > >>Jinzhong > > > >> > > > >> > > > >> > > > >>On Wed, Mar 27, 2024 at 11:49 PM Yun Tang <myas...@live.com> wrote: > > > >> > > > >>> Hi Jinzhong, > > > >>> > > > >>> The overall design looks good. > > > >>> > > > >>> I have two minor questions: > > > >>> > > > >>> 1. Why must we have another 'subTask-checkpoint-sub-dir' under the > > > shared > > > >>> directory? if we don't consider making TM ownership in this FLIP, > > this > > > >>> design seems unnecessary. > > > >>> 2. This FLIP forgets to mention the cleanup of the remote working > > > >>> directory in case of the taskmanager crushes, even though this is > an > > > open > > > >>> problem, we can still leave some space for future optimization. > > > >>> > > > >>> Best, > > > >>> Yun Tang > > > >>> > > > >>> ________________________________ > > > >>> From: Jinzhong Li <lijinzhong2...@gmail.com> > > > >>> Sent: Monday, March 25, 2024 10:41 > > > >>> To: dev@flink.apache.org <dev@flink.apache.org> > > > >>> Subject: Re: [DISCUSS] FLIP-428: Fault Tolerance/Rescale > Integration > > > for > > > >>> Disaggregated State > > > >>> > > > >>> Hi Yue, > > > >>> > > > >>> Thanks for your comments. > > > >>> > > > >>> The CURRENT is a special file that points to the latest manifest > log > > > >>> file. As Zakelly explained above, we could record the latest > manifest > > > >>> filename during sync phase, and write the filename into CURRENT > > > snapshot > > > >>> file during async phase. > > > >>> > > > >>> Best, > > > >>> Jinzhong > > > >>> > > > >>> On Fri, Mar 22, 2024 at 11:16 PM Zakelly Lan < > zakelly....@gmail.com> > > > >>> wrote: > > > >>> > > > >>> > Hi Yue, > > > >>> > > > > >>> > Thanks for bringing this up! > > > >>> > > > > >>> > The CURRENT FILE is the special one, which should be snapshot > > during > > > the > > > >>> > sync phase (temporary load into memory). Thus we can solve this. > > > >>> > > > > >>> > > > > >>> > Best, > > > >>> > Zakelly > > > >>> > > > > >>> > On Fri, Mar 22, 2024 at 4:55 PM yue ma <mayuefi...@gmail.com> > > wrote: > > > >>> > > > > >>> > > Hi jinzhong, > > > >>> > > Thanks for you reply. I still have some doubts about the first > > > >>> question. > > > >>> > Is > > > >>> > > there such a case > > > >>> > > When you made a snapshot during the synchronization phase, you > > > recorded > > > >>> > the > > > >>> > > current and manifest 8, but before asynchronous phase, the > > manifest > > > >>> > reached > > > >>> > > the size threshold and then the CURRENT FILE pointed to the new > > > >>> manifest > > > >>> > 9, > > > >>> > > and then uploaded the incorrect CURRENT file ? > > > >>> > > > > > >>> > > Jinzhong Li <lijinzhong2...@gmail.com> 于2024年3月20日周三 20:13写道: > > > >>> > > > > > >>> > > > Hi Yue, > > > >>> > > > > > > >>> > > > Thanks for your feedback! > > > >>> > > > > > > >>> > > > > 1. If we choose Option-3 for ForSt , how would we handle > > > Manifest > > > >>> > File > > > >>> > > > > ? Should we take a snapshot of the Manifest during the > > > >>> > synchronization > > > >>> > > > phase? > > > >>> > > > > > > >>> > > > IIUC, the GetLiveFiles() API in Option-3 can also catch the > > > fileInfo > > > >>> of > > > >>> > > > Manifest files, and this api also return the manifest file > > size, > > > >>> which > > > >>> > > > means this api could take snapshot for Manifest FileInfo > > > (filename + > > > >>> > > > fileSize) during the synchronization phase. > > > >>> > > > You could refer to the rocksdb source code[1] to verify this. > > > >>> > > > > > > >>> > > > > > > >>> > > > > However, many distributed storage systems do not support > the > > > >>> > > > > ability of Fast Duplicate (such as HDFS). But ForSt has the > > > ability > > > >>> > to > > > >>> > > > > directly read and write remote files. Can we not copy or > Fast > > > >>> > duplicate > > > >>> > > > > these files, but instand of directly reuse and. reference > > these > > > >>> > remote > > > >>> > > > > files? I think this can reduce file download time and may > be > > > more > > > >>> > > useful > > > >>> > > > > for most users who use HDFS (do not support Fast > Duplicate)? > > > >>> > > > > > > >>> > > > Firstly, as far as I know, most remote file systems support > the > > > >>> > > > FastDuplicate, eg. S3 copyObject/Azure Blob Storage > > copyBlob/OSS > > > >>> > > > copyObject, and the HDFS indeed does not support > FastDuplicate. > > > >>> > > > > > > >>> > > > Actually,we have considered the design which reuses remote > > > files. And > > > >>> > > that > > > >>> > > > is what we want to implement in the coming future, where both > > > >>> > checkpoints > > > >>> > > > and restores can reuse existing files residing on the remote > > > state > > > >>> > > storage. > > > >>> > > > However, this design conflicts with the current file > management > > > >>> system > > > >>> > in > > > >>> > > > Flink. At present, remote state files are managed by the > > ForStDB > > > >>> > > > (TaskManager side), while checkpoint files are managed by the > > > >>> > JobManager, > > > >>> > > > which is a major hindrance to file reuse. For example, issues > > > could > > > >>> > arise > > > >>> > > > if a TM reuses a checkpoint file that is subsequently deleted > > by > > > the > > > >>> > JM. > > > >>> > > > Therefore, as mentioned in FLIP-423[2], our roadmap is to > first > > > >>> > integrate > > > >>> > > > checkpoint/restore mechanisms with existing framework at > > > >>> milestone-1. > > > >>> > > > Then, at milestone-2, we plan to introduce TM State Ownership > > and > > > >>> > Faster > > > >>> > > > Checkpointing mechanisms, which will allow both checkpointing > > and > > > >>> > > restoring > > > >>> > > > to directly reuse remote files, thus achieving faster > > > checkpointing > > > >>> and > > > >>> > > > restoring. > > > >>> > > > > > > >>> > > > [1] > > > >>> > > > > > > >>> > > > > > > >>> > > > > > >>> > > > > >>> > > > > > > https://github.com/facebook/rocksdb/blob/6ddfa5f06140c8d0726b561e16dc6894138bcfa0/db/db_filesnapshot.cc#L77 > > > >>> > > > [2] > > > >>> > > > > > > >>> > > > > > > >>> > > > > > >>> > > > > >>> > > > > > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=293046855#FLIP423:DisaggregatedStateStorageandManagement(UmbrellaFLIP)-RoadMap+LaunchingPlan > > > >>> > > > > > > >>> > > > Best, > > > >>> > > > Jinzhong > > > >>> > > > > > > >>> > > > > > > >>> > > > > > > >>> > > > > > > >>> > > > > > > >>> > > > > > > >>> > > > > > > >>> > > > On Wed, Mar 20, 2024 at 4:01 PM yue ma <mayuefi...@gmail.com > > > > > wrote: > > > >>> > > > > > > >>> > > > > Hi Jinzhong > > > >>> > > > > > > > >>> > > > > Thank you for initiating this FLIP. > > > >>> > > > > > > > >>> > > > > I have just some minor question: > > > >>> > > > > > > > >>> > > > > 1. If we choice Option-3 for ForSt , how would we handle > > > Manifest > > > >>> > File > > > >>> > > > > ? Should we take snapshot of the Manifest during the > > > >>> synchronization > > > >>> > > > phase? > > > >>> > > > > Otherwise, may the Manifest and MetaInfo information be > > > >>> inconsistent > > > >>> > > > during > > > >>> > > > > recovery? > > > >>> > > > > 2. For the Restore Operation , we need Fast Duplicate > > > Checkpoint > > > >>> > Files > > > >>> > > > to > > > >>> > > > > Working Dir . However, many distributed storage systems do > > not > > > >>> > support > > > >>> > > > the > > > >>> > > > > ability of Fast Duplicate (such as HDFS). But ForSt has the > > > ability > > > >>> > to > > > >>> > > > > directly read and write remote files. Can we not copy or > Fast > > > >>> > duplicate > > > >>> > > > > these files, but instand of directly reuse and. reference > > these > > > >>> > remote > > > >>> > > > > files? I think this can reduce file download time and may > be > > > more > > > >>> > > useful > > > >>> > > > > for most users who use HDFS (do not support Fast > Duplicate)? > > > >>> > > > > > > > >>> > > > > -- > > > >>> > > > > Best, > > > >>> > > > > Yue > > > >>> > > > > > > > >>> > > > > > > >>> > > > > > >>> > > > > > >>> > > -- > > > >>> > > Best, > > > >>> > > Yue > > > >>> > > > > > >>> > > > > >>> > > > > > > > > > -- > > Best, > > Hangxiang. > > >