Hi Jinzhong :
> I suggest that we could postpone this topic for now and consider it > comprehensively combined with the TM ownership file management in the future > FLIP. Sorry I still think we should consider the cleanup of the working dir in this FLIP, although we may come up with a better solution in a subsequent flip, I think it is important to maintain the integrity of the current changes. Otherwise we may suffer from wasted DFS space for some time. Perhaps we only need a simple cleanup strategy at this stage, such as proactive cleanup when TM exits. While this may fail in the case of a TM crash, it already alleviates the problem. —————————————— Best regards, Feifan Wang At 2024-03-28 11:15:11, "Jinzhong Li" <lijinzhong2...@gmail.com> wrote: >Hi Yun, > >Thanks for your reply. > >> 1. Why must we have another 'subTask-checkpoint-sub-dir' >> under the shared directory? if we don't consider making >> TM ownership in this FLIP, this design seems unnecessary. > > Good catch! We will not change the directory layout of shared directory in >this FLIP. I have already removed this part from this FLIP. I think we >could revisit this topic in a future FLIP about TM ownership. > >> 2. This FLIP forgets to mention the cleanup of the remote >> working directory in case of the taskmanager crushes, >> even though this is an open problem, we can still leave >> some space for future optimization. > >Considering that we have plans to merge TM working dir and checkpoint dir >into one directory, I suggest that we could postpone this topic for now and >consider it comprehensively combined with the TM ownership file management >in the future FLIP. > >Best, >Jinzhong > > > >On Wed, Mar 27, 2024 at 11:49 PM Yun Tang <myas...@live.com> wrote: > >> Hi Jinzhong, >> >> The overall design looks good. >> >> I have two minor questions: >> >> 1. Why must we have another 'subTask-checkpoint-sub-dir' under the shared >> directory? if we don't consider making TM ownership in this FLIP, this >> design seems unnecessary. >> 2. This FLIP forgets to mention the cleanup of the remote working >> directory in case of the taskmanager crushes, even though this is an open >> problem, we can still leave some space for future optimization. >> >> Best, >> Yun Tang >> >> ________________________________ >> From: Jinzhong Li <lijinzhong2...@gmail.com> >> Sent: Monday, March 25, 2024 10:41 >> To: dev@flink.apache.org <dev@flink.apache.org> >> Subject: Re: [DISCUSS] FLIP-428: Fault Tolerance/Rescale Integration for >> Disaggregated State >> >> Hi Yue, >> >> Thanks for your comments. >> >> The CURRENT is a special file that points to the latest manifest log >> file. As Zakelly explained above, we could record the latest manifest >> filename during sync phase, and write the filename into CURRENT snapshot >> file during async phase. >> >> Best, >> Jinzhong >> >> On Fri, Mar 22, 2024 at 11:16 PM Zakelly Lan <zakelly....@gmail.com> >> wrote: >> >> > Hi Yue, >> > >> > Thanks for bringing this up! >> > >> > The CURRENT FILE is the special one, which should be snapshot during the >> > sync phase (temporary load into memory). Thus we can solve this. >> > >> > >> > Best, >> > Zakelly >> > >> > On Fri, Mar 22, 2024 at 4:55 PM yue ma <mayuefi...@gmail.com> wrote: >> > >> > > Hi jinzhong, >> > > Thanks for you reply. I still have some doubts about the first >> question. >> > Is >> > > there such a case >> > > When you made a snapshot during the synchronization phase, you recorded >> > the >> > > current and manifest 8, but before asynchronous phase, the manifest >> > reached >> > > the size threshold and then the CURRENT FILE pointed to the new >> manifest >> > 9, >> > > and then uploaded the incorrect CURRENT file ? >> > > >> > > Jinzhong Li <lijinzhong2...@gmail.com> 于2024年3月20日周三 20:13写道: >> > > >> > > > Hi Yue, >> > > > >> > > > Thanks for your feedback! >> > > > >> > > > > 1. If we choose Option-3 for ForSt , how would we handle Manifest >> > File >> > > > > ? Should we take a snapshot of the Manifest during the >> > synchronization >> > > > phase? >> > > > >> > > > IIUC, the GetLiveFiles() API in Option-3 can also catch the fileInfo >> of >> > > > Manifest files, and this api also return the manifest file size, >> which >> > > > means this api could take snapshot for Manifest FileInfo (filename + >> > > > fileSize) during the synchronization phase. >> > > > You could refer to the rocksdb source code[1] to verify this. >> > > > >> > > > >> > > > > However, many distributed storage systems do not support the >> > > > > ability of Fast Duplicate (such as HDFS). But ForSt has the ability >> > to >> > > > > directly read and write remote files. Can we not copy or Fast >> > duplicate >> > > > > these files, but instand of directly reuse and. reference these >> > remote >> > > > > files? I think this can reduce file download time and may be more >> > > useful >> > > > > for most users who use HDFS (do not support Fast Duplicate)? >> > > > >> > > > Firstly, as far as I know, most remote file systems support the >> > > > FastDuplicate, eg. S3 copyObject/Azure Blob Storage copyBlob/OSS >> > > > copyObject, and the HDFS indeed does not support FastDuplicate. >> > > > >> > > > Actually,we have considered the design which reuses remote files. And >> > > that >> > > > is what we want to implement in the coming future, where both >> > checkpoints >> > > > and restores can reuse existing files residing on the remote state >> > > storage. >> > > > However, this design conflicts with the current file management >> system >> > in >> > > > Flink. At present, remote state files are managed by the ForStDB >> > > > (TaskManager side), while checkpoint files are managed by the >> > JobManager, >> > > > which is a major hindrance to file reuse. For example, issues could >> > arise >> > > > if a TM reuses a checkpoint file that is subsequently deleted by the >> > JM. >> > > > Therefore, as mentioned in FLIP-423[2], our roadmap is to first >> > integrate >> > > > checkpoint/restore mechanisms with existing framework at >> milestone-1. >> > > > Then, at milestone-2, we plan to introduce TM State Ownership and >> > Faster >> > > > Checkpointing mechanisms, which will allow both checkpointing and >> > > restoring >> > > > to directly reuse remote files, thus achieving faster checkpointing >> and >> > > > restoring. >> > > > >> > > > [1] >> > > > >> > > > >> > > >> > >> https://github.com/facebook/rocksdb/blob/6ddfa5f06140c8d0726b561e16dc6894138bcfa0/db/db_filesnapshot.cc#L77 >> > > > [2] >> > > > >> > > > >> > > >> > >> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=293046855#FLIP423:DisaggregatedStateStorageandManagement(UmbrellaFLIP)-RoadMap+LaunchingPlan >> > > > >> > > > Best, >> > > > Jinzhong >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > On Wed, Mar 20, 2024 at 4:01 PM yue ma <mayuefi...@gmail.com> wrote: >> > > > >> > > > > Hi Jinzhong >> > > > > >> > > > > Thank you for initiating this FLIP. >> > > > > >> > > > > I have just some minor question: >> > > > > >> > > > > 1. If we choice Option-3 for ForSt , how would we handle Manifest >> > File >> > > > > ? Should we take snapshot of the Manifest during the >> synchronization >> > > > phase? >> > > > > Otherwise, may the Manifest and MetaInfo information be >> inconsistent >> > > > during >> > > > > recovery? >> > > > > 2. For the Restore Operation , we need Fast Duplicate Checkpoint >> > Files >> > > > to >> > > > > Working Dir . However, many distributed storage systems do not >> > support >> > > > the >> > > > > ability of Fast Duplicate (such as HDFS). But ForSt has the ability >> > to >> > > > > directly read and write remote files. Can we not copy or Fast >> > duplicate >> > > > > these files, but instand of directly reuse and. reference these >> > remote >> > > > > files? I think this can reduce file download time and may be more >> > > useful >> > > > > for most users who use HDFS (do not support Fast Duplicate)? >> > > > > >> > > > > -- >> > > > > Best, >> > > > > Yue >> > > > > >> > > > >> > > >> > > >> > > -- >> > > Best, >> > > Yue >> > > >> > >>