Re:Re: [DISCUSS] FLIP-428: Fault Tolerance/Rescale Integration for Disaggregated State

Feifan Wang Wed, 27 Mar 2024 20:56:39 -0700

Hi Jinzhong :


> I suggest that we could postpone this topic for now and consider it 
> comprehensively combined with the TM ownership file management in the future 
> FLIP.


Sorry I still think we should consider the cleanup of the working dir in this 
FLIP, although we may come up with a better solution in a subsequent flip, I 
think it is important to maintain the integrity of the current changes. 
Otherwise we may suffer from wasted DFS space for some time.
Perhaps we only need a simple cleanup strategy at this stage, such as proactive 
cleanup when TM exits. While this may fail in the case of a TM crash, it 
already alleviates the problem.




——————————————

Best regards,

Feifan Wang




At 2024-03-28 11:15:11, "Jinzhong Li" <[email protected]> wrote:
>Hi Yun,
>
>Thanks for your reply.
>
>> 1. Why must we have another 'subTask-checkpoint-sub-dir'
>> under the shared directory? if we don't consider making
>> TM ownership in this FLIP, this design seems unnecessary.
>
> Good catch! We will not change the directory layout of shared directory in
>this FLIP. I have already removed this part from this FLIP. I think we
>could revisit this topic in a future FLIP about TM ownership.
>
>> 2. This FLIP forgets to mention the cleanup of the remote
>> working directory in case of the taskmanager crushes,
>> even though this is an open problem, we can still leave
>> some space for future optimization.
>
>Considering that we have plans to merge TM working dir and checkpoint dir
>into one directory, I suggest that we could postpone this topic for now and
>consider it comprehensively combined with the TM ownership file management
>in the future FLIP.
>
>Best,
>Jinzhong
>
>
>
>On Wed, Mar 27, 2024 at 11:49 PM Yun Tang <[email protected]> wrote:
>
>> Hi Jinzhong,
>>
>> The overall design looks good.
>>
>> I have two minor questions:
>>
>> 1. Why must we have another 'subTask-checkpoint-sub-dir' under the shared
>> directory? if we don't consider making TM ownership in this FLIP, this
>> design seems unnecessary.
>> 2. This FLIP forgets to mention the cleanup of the remote working
>> directory in case of the taskmanager crushes, even though this is an open
>> problem, we can still leave some space for future optimization.
>>
>> Best,
>> Yun Tang
>>
>> ________________________________
>> From: Jinzhong Li <[email protected]>
>> Sent: Monday, March 25, 2024 10:41
>> To: [email protected] <[email protected]>
>> Subject: Re: [DISCUSS] FLIP-428: Fault Tolerance/Rescale Integration for
>> Disaggregated State
>>
>> Hi Yue,
>>
>> Thanks for your comments.
>>
>> The CURRENT is a special file that points to the latest manifest log
>> file. As Zakelly explained above, we could record the latest manifest
>> filename during sync phase, and write the filename into CURRENT snapshot
>> file during async phase.
>>
>> Best,
>> Jinzhong
>>
>> On Fri, Mar 22, 2024 at 11:16 PM Zakelly Lan <[email protected]>
>> wrote:
>>
>> > Hi Yue,
>> >
>> > Thanks for bringing this up!
>> >
>> > The CURRENT FILE is the special one, which should be snapshot during the
>> > sync phase (temporary load into memory). Thus we can solve this.
>> >
>> >
>> > Best,
>> > Zakelly
>> >
>> > On Fri, Mar 22, 2024 at 4:55 PM yue ma <[email protected]> wrote:
>> >
>> > > Hi jinzhong,
>> > > Thanks for you reply. I still have some doubts about the first
>> question.
>> > Is
>> > > there such a case
>> > > When you made a snapshot during the synchronization phase, you recorded
>> > the
>> > > current and manifest 8, but before asynchronous phase, the manifest
>> > reached
>> > > the size threshold and then the CURRENT FILE pointed to the new
>> manifest
>> > 9,
>> > > and then uploaded the incorrect CURRENT file ?
>> > >
>> > > Jinzhong Li <[email protected]> 于2024年3月20日周三 20:13写道：
>> > >
>> > > > Hi Yue,
>> > > >
>> > > > Thanks for your feedback!
>> > > >
>> > > > > 1. If we choose Option-3 for ForSt , how would we handle Manifest
>> > File
>> > > > > ? Should we take a snapshot of the Manifest during the
>> > synchronization
>> > > > phase?
>> > > >
>> > > > IIUC, the GetLiveFiles() API in Option-3 can also catch the fileInfo
>> of
>> > > > Manifest files, and this api also return the manifest file size,
>> which
>> > > > means this api could take snapshot for Manifest FileInfo (filename +
>> > > > fileSize) during the synchronization phase.
>> > > > You could refer to the rocksdb source code[1] to verify this.
>> > > >
>> > > >
>> > > >  > However, many distributed storage systems do not support the
>> > > > > ability of Fast Duplicate (such as HDFS). But ForSt has the ability
>> > to
>> > > > > directly read and write remote files. Can we not copy or Fast
>> > duplicate
>> > > > > these files, but instand of directly reuse and. reference these
>> > remote
>> > > > > files? I think this can reduce file download time and may be more
>> > > useful
>> > > > > for most users who use HDFS (do not support Fast Duplicate)?
>> > > >
>> > > > Firstly, as far as I know, most remote file systems support the
>> > > > FastDuplicate, eg. S3 copyObject/Azure Blob Storage copyBlob/OSS
>> > > > copyObject, and the HDFS indeed does not support FastDuplicate.
>> > > >
>> > > > Actually，we have considered the design which reuses remote files. And
>> > > that
>> > > > is what we want to implement in the coming future, where both
>> > checkpoints
>> > > > and restores can reuse existing files residing on the remote state
>> > > storage.
>> > > > However, this design conflicts with the current file management
>> system
>> > in
>> > > > Flink.  At present, remote state files are managed by the ForStDB
>> > > > (TaskManager side), while checkpoint files are managed by the
>> > JobManager,
>> > > > which is a major hindrance to file reuse. For example, issues could
>> > arise
>> > > > if a TM reuses a checkpoint file that is subsequently deleted by the
>> > JM.
>> > > > Therefore, as mentioned in FLIP-423[2], our roadmap is to first
>> > integrate
>> > > > checkpoint/restore mechanisms with existing framework  at
>> milestone-1.
>> > > > Then, at milestone-2, we plan to introduce TM State Ownership and
>> > Faster
>> > > > Checkpointing mechanisms, which will allow both checkpointing and
>> > > restoring
>> > > > to directly reuse remote files, thus achieving faster checkpointing
>> and
>> > > > restoring.
>> > > >
>> > > > [1]
>> > > >
>> > > >
>> > >
>> >
>> https://github.com/facebook/rocksdb/blob/6ddfa5f06140c8d0726b561e16dc6894138bcfa0/db/db_filesnapshot.cc#L77
>> > > > [2]
>> > > >
>> > > >
>> > >
>> >
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=293046855#FLIP423:DisaggregatedStateStorageandManagement(UmbrellaFLIP)-RoadMap+LaunchingPlan
>> > > >
>> > > > Best,
>> > > > Jinzhong
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > > On Wed, Mar 20, 2024 at 4:01 PM yue ma <[email protected]> wrote:
>> > > >
>> > > > > Hi Jinzhong
>> > > > >
>> > > > > Thank you for initiating this FLIP.
>> > > > >
>> > > > > I have just some minor question:
>> > > > >
>> > > > > 1. If we choice Option-3 for ForSt , how would we handle Manifest
>> > File
>> > > > > ? Should we take snapshot of the Manifest during the
>> synchronization
>> > > > phase?
>> > > > > Otherwise, may the Manifest and MetaInfo information be
>> inconsistent
>> > > > during
>> > > > > recovery?
>> > > > > 2. For the Restore Operation , we need Fast Duplicate  Checkpoint
>> > Files
>> > > > to
>> > > > > Working Dir . However, many distributed storage systems do not
>> > support
>> > > > the
>> > > > > ability of Fast Duplicate (such as HDFS). But ForSt has the ability
>> > to
>> > > > > directly read and write remote files. Can we not copy or Fast
>> > duplicate
>> > > > > these files, but instand of directly reuse and. reference these
>> > remote
>> > > > > files? I think this can reduce file download time and may be more
>> > > useful
>> > > > > for most users who use HDFS (do not support Fast Duplicate)?
>> > > > >
>> > > > > --
>> > > > > Best,
>> > > > > Yue
>> > > > >
>> > > >
>> > >
>> > >
>> > > --
>> > > Best,
>> > > Yue
>> > >
>> >
>>

Re:Re: [DISCUSS] FLIP-428: Fault Tolerance/Rescale Integration for Disaggregated State

Reply via email to