And I think the cleanup of working dir should be discussion in FLIP-427[1] ( this mail list [2]) ?
[1] https://cwiki.apache.org/confluence/x/T4p3EQ [2] https://lists.apache.org/thread/vktfzqvb7t4rltg7fdlsyd9sfdmrc4ft —————————————— Best regards, Feifan Wang At 2024-03-28 11:56:22, "Feifan Wang" <zoltar9...@163.com> wrote: >Hi Jinzhong : > > >> I suggest that we could postpone this topic for now and consider it >> comprehensively combined with the TM ownership file management in the future >> FLIP. > > >Sorry I still think we should consider the cleanup of the working dir in this >FLIP, although we may come up with a better solution in a subsequent flip, I >think it is important to maintain the integrity of the current changes. >Otherwise we may suffer from wasted DFS space for some time. >Perhaps we only need a simple cleanup strategy at this stage, such as >proactive cleanup when TM exits. While this may fail in the case of a TM >crash, it already alleviates the problem. > > > > >—————————————— > >Best regards, > >Feifan Wang > > > > >At 2024-03-28 11:15:11, "Jinzhong Li" <lijinzhong2...@gmail.com> wrote: >>Hi Yun, >> >>Thanks for your reply. >> >>> 1. Why must we have another 'subTask-checkpoint-sub-dir' >>> under the shared directory? if we don't consider making >>> TM ownership in this FLIP, this design seems unnecessary. >> >> Good catch! We will not change the directory layout of shared directory in >>this FLIP. I have already removed this part from this FLIP. I think we >>could revisit this topic in a future FLIP about TM ownership. >> >>> 2. This FLIP forgets to mention the cleanup of the remote >>> working directory in case of the taskmanager crushes, >>> even though this is an open problem, we can still leave >>> some space for future optimization. >> >>Considering that we have plans to merge TM working dir and checkpoint dir >>into one directory, I suggest that we could postpone this topic for now and >>consider it comprehensively combined with the TM ownership file management >>in the future FLIP. >> >>Best, >>Jinzhong >> >> >> >>On Wed, Mar 27, 2024 at 11:49 PM Yun Tang <myas...@live.com> wrote: >> >>> Hi Jinzhong, >>> >>> The overall design looks good. >>> >>> I have two minor questions: >>> >>> 1. Why must we have another 'subTask-checkpoint-sub-dir' under the shared >>> directory? if we don't consider making TM ownership in this FLIP, this >>> design seems unnecessary. >>> 2. This FLIP forgets to mention the cleanup of the remote working >>> directory in case of the taskmanager crushes, even though this is an open >>> problem, we can still leave some space for future optimization. >>> >>> Best, >>> Yun Tang >>> >>> ________________________________ >>> From: Jinzhong Li <lijinzhong2...@gmail.com> >>> Sent: Monday, March 25, 2024 10:41 >>> To: dev@flink.apache.org <dev@flink.apache.org> >>> Subject: Re: [DISCUSS] FLIP-428: Fault Tolerance/Rescale Integration for >>> Disaggregated State >>> >>> Hi Yue, >>> >>> Thanks for your comments. >>> >>> The CURRENT is a special file that points to the latest manifest log >>> file. As Zakelly explained above, we could record the latest manifest >>> filename during sync phase, and write the filename into CURRENT snapshot >>> file during async phase. >>> >>> Best, >>> Jinzhong >>> >>> On Fri, Mar 22, 2024 at 11:16 PM Zakelly Lan <zakelly....@gmail.com> >>> wrote: >>> >>> > Hi Yue, >>> > >>> > Thanks for bringing this up! >>> > >>> > The CURRENT FILE is the special one, which should be snapshot during the >>> > sync phase (temporary load into memory). Thus we can solve this. >>> > >>> > >>> > Best, >>> > Zakelly >>> > >>> > On Fri, Mar 22, 2024 at 4:55 PM yue ma <mayuefi...@gmail.com> wrote: >>> > >>> > > Hi jinzhong, >>> > > Thanks for you reply. I still have some doubts about the first >>> question. >>> > Is >>> > > there such a case >>> > > When you made a snapshot during the synchronization phase, you recorded >>> > the >>> > > current and manifest 8, but before asynchronous phase, the manifest >>> > reached >>> > > the size threshold and then the CURRENT FILE pointed to the new >>> manifest >>> > 9, >>> > > and then uploaded the incorrect CURRENT file ? >>> > > >>> > > Jinzhong Li <lijinzhong2...@gmail.com> 于2024年3月20日周三 20:13写道: >>> > > >>> > > > Hi Yue, >>> > > > >>> > > > Thanks for your feedback! >>> > > > >>> > > > > 1. If we choose Option-3 for ForSt , how would we handle Manifest >>> > File >>> > > > > ? Should we take a snapshot of the Manifest during the >>> > synchronization >>> > > > phase? >>> > > > >>> > > > IIUC, the GetLiveFiles() API in Option-3 can also catch the fileInfo >>> of >>> > > > Manifest files, and this api also return the manifest file size, >>> which >>> > > > means this api could take snapshot for Manifest FileInfo (filename + >>> > > > fileSize) during the synchronization phase. >>> > > > You could refer to the rocksdb source code[1] to verify this. >>> > > > >>> > > > >>> > > > > However, many distributed storage systems do not support the >>> > > > > ability of Fast Duplicate (such as HDFS). But ForSt has the ability >>> > to >>> > > > > directly read and write remote files. Can we not copy or Fast >>> > duplicate >>> > > > > these files, but instand of directly reuse and. reference these >>> > remote >>> > > > > files? I think this can reduce file download time and may be more >>> > > useful >>> > > > > for most users who use HDFS (do not support Fast Duplicate)? >>> > > > >>> > > > Firstly, as far as I know, most remote file systems support the >>> > > > FastDuplicate, eg. S3 copyObject/Azure Blob Storage copyBlob/OSS >>> > > > copyObject, and the HDFS indeed does not support FastDuplicate. >>> > > > >>> > > > Actually,we have considered the design which reuses remote files. And >>> > > that >>> > > > is what we want to implement in the coming future, where both >>> > checkpoints >>> > > > and restores can reuse existing files residing on the remote state >>> > > storage. >>> > > > However, this design conflicts with the current file management >>> system >>> > in >>> > > > Flink. At present, remote state files are managed by the ForStDB >>> > > > (TaskManager side), while checkpoint files are managed by the >>> > JobManager, >>> > > > which is a major hindrance to file reuse. For example, issues could >>> > arise >>> > > > if a TM reuses a checkpoint file that is subsequently deleted by the >>> > JM. >>> > > > Therefore, as mentioned in FLIP-423[2], our roadmap is to first >>> > integrate >>> > > > checkpoint/restore mechanisms with existing framework at >>> milestone-1. >>> > > > Then, at milestone-2, we plan to introduce TM State Ownership and >>> > Faster >>> > > > Checkpointing mechanisms, which will allow both checkpointing and >>> > > restoring >>> > > > to directly reuse remote files, thus achieving faster checkpointing >>> and >>> > > > restoring. >>> > > > >>> > > > [1] >>> > > > >>> > > > >>> > > >>> > >>> https://github.com/facebook/rocksdb/blob/6ddfa5f06140c8d0726b561e16dc6894138bcfa0/db/db_filesnapshot.cc#L77 >>> > > > [2] >>> > > > >>> > > > >>> > > >>> > >>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=293046855#FLIP423:DisaggregatedStateStorageandManagement(UmbrellaFLIP)-RoadMap+LaunchingPlan >>> > > > >>> > > > Best, >>> > > > Jinzhong >>> > > > >>> > > > >>> > > > >>> > > > >>> > > > >>> > > > >>> > > > >>> > > > On Wed, Mar 20, 2024 at 4:01 PM yue ma <mayuefi...@gmail.com> wrote: >>> > > > >>> > > > > Hi Jinzhong >>> > > > > >>> > > > > Thank you for initiating this FLIP. >>> > > > > >>> > > > > I have just some minor question: >>> > > > > >>> > > > > 1. If we choice Option-3 for ForSt , how would we handle Manifest >>> > File >>> > > > > ? Should we take snapshot of the Manifest during the >>> synchronization >>> > > > phase? >>> > > > > Otherwise, may the Manifest and MetaInfo information be >>> inconsistent >>> > > > during >>> > > > > recovery? >>> > > > > 2. For the Restore Operation , we need Fast Duplicate Checkpoint >>> > Files >>> > > > to >>> > > > > Working Dir . However, many distributed storage systems do not >>> > support >>> > > > the >>> > > > > ability of Fast Duplicate (such as HDFS). But ForSt has the ability >>> > to >>> > > > > directly read and write remote files. Can we not copy or Fast >>> > duplicate >>> > > > > these files, but instand of directly reuse and. reference these >>> > remote >>> > > > > files? I think this can reduce file download time and may be more >>> > > useful >>> > > > > for most users who use HDFS (do not support Fast Duplicate)? >>> > > > > >>> > > > > -- >>> > > > > Best, >>> > > > > Yue >>> > > > > >>> > > > >>> > > >>> > > >>> > > -- >>> > > Best, >>> > > Yue >>> > > >>> > >>>