Re: [DISCUSS] FLIP-428: Fault Tolerance/Rescale Integration for Disaggregated State

Feifan Wang Wed, 27 Mar 2024 21:06:54 -0700

And I think the cleanup of working dir should be discussion in FLIP-427[1] ( 
this mail list [2]) ?



[1] https://cwiki.apache.org/confluence/x/T4p3EQ
[2] https://lists.apache.org/thread/vktfzqvb7t4rltg7fdlsyd9sfdmrc4ft

——————————————

Best regards,

Feifan Wang




At 2024-03-28 11:56:22, "Feifan Wang" <zoltar9...@163.com> wrote:
>Hi Jinzhong :
>
>
>> I suggest that we could postpone this topic for now and consider it 
>> comprehensively combined with the TM ownership file management in the future 
>> FLIP.
>
>
>Sorry I still think we should consider the cleanup of the working dir in this 
>FLIP, although we may come up with a better solution in a subsequent flip, I 
>think it is important to maintain the integrity of the current changes. 
>Otherwise we may suffer from wasted DFS space for some time.
>Perhaps we only need a simple cleanup strategy at this stage, such as 
>proactive cleanup when TM exits. While this may fail in the case of a TM 
>crash, it already alleviates the problem.
>
>
>
>
>——————————————
>
>Best regards,
>
>Feifan Wang
>
>
>
>
>At 2024-03-28 11:15:11, "Jinzhong Li" <lijinzhong2...@gmail.com> wrote:
>>Hi Yun,
>>
>>Thanks for your reply.
>>
>>> 1. Why must we have another 'subTask-checkpoint-sub-dir'
>>> under the shared directory? if we don't consider making
>>> TM ownership in this FLIP, this design seems unnecessary.
>>
>> Good catch! We will not change the directory layout of shared directory in
>>this FLIP. I have already removed this part from this FLIP. I think we
>>could revisit this topic in a future FLIP about TM ownership.
>>
>>> 2. This FLIP forgets to mention the cleanup of the remote
>>> working directory in case of the taskmanager crushes,
>>> even though this is an open problem, we can still leave
>>> some space for future optimization.
>>
>>Considering that we have plans to merge TM working dir and checkpoint dir
>>into one directory, I suggest that we could postpone this topic for now and
>>consider it comprehensively combined with the TM ownership file management
>>in the future FLIP.
>>
>>Best,
>>Jinzhong
>>
>>
>>
>>On Wed, Mar 27, 2024 at 11:49 PM Yun Tang <myas...@live.com> wrote:
>>
>>> Hi Jinzhong,
>>>
>>> The overall design looks good.
>>>
>>> I have two minor questions:
>>>
>>> 1. Why must we have another 'subTask-checkpoint-sub-dir' under the shared
>>> directory? if we don't consider making TM ownership in this FLIP, this
>>> design seems unnecessary.
>>> 2. This FLIP forgets to mention the cleanup of the remote working
>>> directory in case of the taskmanager crushes, even though this is an open
>>> problem, we can still leave some space for future optimization.
>>>
>>> Best,
>>> Yun Tang
>>>
>>> ________________________________
>>> From: Jinzhong Li <lijinzhong2...@gmail.com>
>>> Sent: Monday, March 25, 2024 10:41
>>> To: dev@flink.apache.org <dev@flink.apache.org>
>>> Subject: Re: [DISCUSS] FLIP-428: Fault Tolerance/Rescale Integration for
>>> Disaggregated State
>>>
>>> Hi Yue,
>>>
>>> Thanks for your comments.
>>>
>>> The CURRENT is a special file that points to the latest manifest log
>>> file. As Zakelly explained above, we could record the latest manifest
>>> filename during sync phase, and write the filename into CURRENT snapshot
>>> file during async phase.
>>>
>>> Best,
>>> Jinzhong
>>>
>>> On Fri, Mar 22, 2024 at 11:16 PM Zakelly Lan <zakelly....@gmail.com>
>>> wrote:
>>>
>>> > Hi Yue,
>>> >
>>> > Thanks for bringing this up!
>>> >
>>> > The CURRENT FILE is the special one, which should be snapshot during the
>>> > sync phase (temporary load into memory). Thus we can solve this.
>>> >
>>> >
>>> > Best,
>>> > Zakelly
>>> >
>>> > On Fri, Mar 22, 2024 at 4:55 PM yue ma <mayuefi...@gmail.com> wrote:
>>> >
>>> > > Hi jinzhong,
>>> > > Thanks for you reply. I still have some doubts about the first
>>> question.
>>> > Is
>>> > > there such a case
>>> > > When you made a snapshot during the synchronization phase, you recorded
>>> > the
>>> > > current and manifest 8, but before asynchronous phase, the manifest
>>> > reached
>>> > > the size threshold and then the CURRENT FILE pointed to the new
>>> manifest
>>> > 9,
>>> > > and then uploaded the incorrect CURRENT file ?
>>> > >
>>> > > Jinzhong Li <lijinzhong2...@gmail.com> 于2024年3月20日周三 20:13写道：
>>> > >
>>> > > > Hi Yue,
>>> > > >
>>> > > > Thanks for your feedback!
>>> > > >
>>> > > > > 1. If we choose Option-3 for ForSt , how would we handle Manifest
>>> > File
>>> > > > > ? Should we take a snapshot of the Manifest during the
>>> > synchronization
>>> > > > phase?
>>> > > >
>>> > > > IIUC, the GetLiveFiles() API in Option-3 can also catch the fileInfo
>>> of
>>> > > > Manifest files, and this api also return the manifest file size,
>>> which
>>> > > > means this api could take snapshot for Manifest FileInfo (filename +
>>> > > > fileSize) during the synchronization phase.
>>> > > > You could refer to the rocksdb source code[1] to verify this.
>>> > > >
>>> > > >
>>> > > >  > However, many distributed storage systems do not support the
>>> > > > > ability of Fast Duplicate (such as HDFS). But ForSt has the ability
>>> > to
>>> > > > > directly read and write remote files. Can we not copy or Fast
>>> > duplicate
>>> > > > > these files, but instand of directly reuse and. reference these
>>> > remote
>>> > > > > files? I think this can reduce file download time and may be more
>>> > > useful
>>> > > > > for most users who use HDFS (do not support Fast Duplicate)?
>>> > > >
>>> > > > Firstly, as far as I know, most remote file systems support the
>>> > > > FastDuplicate, eg. S3 copyObject/Azure Blob Storage copyBlob/OSS
>>> > > > copyObject, and the HDFS indeed does not support FastDuplicate.
>>> > > >
>>> > > > Actually，we have considered the design which reuses remote files. And
>>> > > that
>>> > > > is what we want to implement in the coming future, where both
>>> > checkpoints
>>> > > > and restores can reuse existing files residing on the remote state
>>> > > storage.
>>> > > > However, this design conflicts with the current file management
>>> system
>>> > in
>>> > > > Flink.  At present, remote state files are managed by the ForStDB
>>> > > > (TaskManager side), while checkpoint files are managed by the
>>> > JobManager,
>>> > > > which is a major hindrance to file reuse. For example, issues could
>>> > arise
>>> > > > if a TM reuses a checkpoint file that is subsequently deleted by the
>>> > JM.
>>> > > > Therefore, as mentioned in FLIP-423[2], our roadmap is to first
>>> > integrate
>>> > > > checkpoint/restore mechanisms with existing framework  at
>>> milestone-1.
>>> > > > Then, at milestone-2, we plan to introduce TM State Ownership and
>>> > Faster
>>> > > > Checkpointing mechanisms, which will allow both checkpointing and
>>> > > restoring
>>> > > > to directly reuse remote files, thus achieving faster checkpointing
>>> and
>>> > > > restoring.
>>> > > >
>>> > > > [1]
>>> > > >
>>> > > >
>>> > >
>>> >
>>> https://github.com/facebook/rocksdb/blob/6ddfa5f06140c8d0726b561e16dc6894138bcfa0/db/db_filesnapshot.cc#L77
>>> > > > [2]
>>> > > >
>>> > > >
>>> > >
>>> >
>>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=293046855#FLIP423:DisaggregatedStateStorageandManagement(UmbrellaFLIP)-RoadMap+LaunchingPlan
>>> > > >
>>> > > > Best,
>>> > > > Jinzhong
>>> > > >
>>> > > >
>>> > > >
>>> > > >
>>> > > >
>>> > > >
>>> > > >
>>> > > > On Wed, Mar 20, 2024 at 4:01 PM yue ma <mayuefi...@gmail.com> wrote:
>>> > > >
>>> > > > > Hi Jinzhong
>>> > > > >
>>> > > > > Thank you for initiating this FLIP.
>>> > > > >
>>> > > > > I have just some minor question:
>>> > > > >
>>> > > > > 1. If we choice Option-3 for ForSt , how would we handle Manifest
>>> > File
>>> > > > > ? Should we take snapshot of the Manifest during the
>>> synchronization
>>> > > > phase?
>>> > > > > Otherwise, may the Manifest and MetaInfo information be
>>> inconsistent
>>> > > > during
>>> > > > > recovery?
>>> > > > > 2. For the Restore Operation , we need Fast Duplicate  Checkpoint
>>> > Files
>>> > > > to
>>> > > > > Working Dir . However, many distributed storage systems do not
>>> > support
>>> > > > the
>>> > > > > ability of Fast Duplicate (such as HDFS). But ForSt has the ability
>>> > to
>>> > > > > directly read and write remote files. Can we not copy or Fast
>>> > duplicate
>>> > > > > these files, but instand of directly reuse and. reference these
>>> > remote
>>> > > > > files? I think this can reduce file download time and may be more
>>> > > useful
>>> > > > > for most users who use HDFS (do not support Fast Duplicate)?
>>> > > > >
>>> > > > > --
>>> > > > > Best,
>>> > > > > Yue
>>> > > > >
>>> > > >
>>> > >
>>> > >
>>> > > --
>>> > > Best,
>>> > > Yue
>>> > >
>>> >
>>>

Re: [DISCUSS] FLIP-428: Fault Tolerance/Rescale Integration for Disaggregated State

Reply via email to