Re: [DISCUSS] FLIP-427: Disaggregated State Store

Hangxiang Yu Sun, 31 Mar 2024 19:18:43 -0700

Hi Yun.
Thanks for the great suggestion.
I just added related information into the FLIP.


On Sat, Mar 30, 2024 at 10:49 AM Yun Tang <[email protected]> wrote:

> Hi Feifan,
>
> I just replied in the discussion of FLIP-428. I agree that we could leave
> the clean-up optimization in the future FLIP, however, I think we should
> mention this topic explicitly in the current FLIP to make the overall
> design complete and more sophisticated.
>
> Best
> Yun Tang
> ________________________________
> From: Feifan Wang <[email protected]>
> Sent: Thursday, March 28, 2024 12:35
> To: [email protected] <[email protected]>
> Subject: Re: [DISCUSS] FLIP-427: Disaggregated State Store
>
> Thanks for your reply, Hangxiang. I totally agree with you about the jni
> part.
>
> Hi Yun Tang, I just noticed that FLIP-427 mentions “The life cycle of
> working dir is managed as before local strategy.” IIUC, the working dir
> will be deleted after TaskManager exit. And I think that's enough for
> current stage, WDYT ?
>
> ——————————————
>
> Best regards,
>
> Feifan Wang
>
>
>
>
> At 2024-03-28 12:18:56, "Hangxiang Yu" <[email protected]> wrote:
> >Hi, Feifan.
> >
> >Thanks for your reply.
> >
> >What if we only use jni to access DFS that needs to reuse Flink
> FileSystem?
> >> And all local disk access through native api. This idea is based on the
> >> understanding that jni overhead is not worth mentioning compared to DFS
> >> access latency. It might make more sense to consider avoiding jni
> overhead
> >> for faster local disks. Since local disk as secondary is already under
> >> consideration [1], maybe we can discuss in that FLIP whether to use
> native
> >> api to access local disk?
> >>
> >This is a good suggestion. It's reasonable to use native api to access
> >local disk cache since it requires lower latency compared to remote
> access.
> >I also believe that the jni overhead is relatively negligible when weighed
> >against the latency of remote I/O as mentioned in the FLIP.
> >So I think we could just go on proposal 2 and keep proposal 1 as a
> >potential future optimization, which could work better when there is a
> >higher performance requirement or some native libraries of filesystems
> have
> >significantly higher performance and resource usage compared to their java
> >libs.
> >
> >
> >On Thu, Mar 28, 2024 at 11:39 AM Feifan Wang <[email protected]> wrote:
> >
> >> Thanks for this valuable proposal Hangxiang !
> >>
> >>
> >> > If we need to introduce a JNI call during each filesystem call, that
> >> would be N times JNI cost compared with the current RocksDB
> state-backend's
> >> JNI cost.
> >> What if we only use jni to access DFS that needs to reuse Flink
> >> FileSystem? And all local disk access through native api. This idea is
> >> based on the understanding that jni overhead is not worth mentioning
> >> compared to DFS access latency. It might make more sense to consider
> >> avoiding jni overhead for faster local disks. Since local disk as
> secondary
> >> is already under consideration [1], maybe we can discuss in that FLIP
> >> whether to use native api to access local disk?
> >>
> >>
> >> >I'd suggest keeping `state.backend.forSt.working-dir` as it is for now.
> >> >Different disaggregated state storages may have their own semantics
> about
> >> >this configuration, e.g. life cycle, supported file systems or
> storages.
> >> I agree with considering moving this configuration up to the engine
> level
> >> until there are other disaggreated backends.
> >>
> >>
> >> [1] https://cwiki.apache.org/confluence/x/U4p3EQ
> >>
> >> ——————————————
> >>
> >> Best regards,
> >>
> >> Feifan Wang
> >>
> >>
> >>
> >>
> >> At 2024-03-28 09:55:48, "Hangxiang Yu" <[email protected]> wrote:
> >> >Hi, Yun.
> >> >Thanks for the reply.
> >> >
> >> >The JNI cost you considered is right. As replied to Yue, I agreed to
> leave
> >> >space and consider proposal 1 as an optimization in the future, which
> is
> >> >also updated in the FLIP.
> >> >
> >> >The other question is that the configuration of
> >> >> `state.backend.forSt.working-dir` looks too coupled with the ForSt
> >> >> state-backend, how would it be if we introduce another disaggregated
> >> state
> >> >> storage? Thus, I think `state.backend.disaggregated.working-dir`
> might
> >> be a
> >> >> better configuration name.
> >> >
> >> >I'd suggest keeping `state.backend.forSt.working-dir` as it is for now.
> >> >Different disaggregated state storages may have their own semantics
> about
> >> >this configuration, e.g. life cycle, supported file systems or
> storages.
> >> >Maybe it's more suitable to consider it together when we introduce
> other
> >> >disaggregated state storages in the future.
> >> >
> >> >On Thu, Mar 28, 2024 at 12:02 AM Yun Tang <[email protected]> wrote:
> >> >
> >> >> Hi Hangxiang,
> >> >>
> >> >> The design looks good, and I also support leaving space for proposal
> 1.
> >> >>
> >> >> As you know, loading index/filter/data blocks for querying across
> levels
> >> >> would introduce high IO access within the LSM tree for old data. If
> we
> >> need
> >> >> to introduce a JNI call during each filesystem call, that would be N
> >> times
> >> >> JNI cost compared with the current RocksDB state-backend's JNI cost.
> >> >>
> >> >> The other question is that the configuration of
> >> >> `state.backend.forSt.working-dir` looks too coupled with the ForSt
> >> >> state-backend, how would it be if we introduce another disaggregated
> >> state
> >> >> storage? Thus, I think `state.backend.disaggregated.working-dir`
> might
> >> be a
> >> >> better configuration name.
> >> >>
> >> >>
> >> >> Best
> >> >> Yun Tang
> >> >>
> >> >> ________________________________
> >> >> From: Hangxiang Yu <[email protected]>
> >> >> Sent: Wednesday, March 20, 2024 11:32
> >> >> To: [email protected] <[email protected]>
> >> >> Subject: Re: [DISCUSS] FLIP-427: Disaggregated State Store
> >> >>
> >> >> Hi, Yue.
> >> >> Thanks for the reply.
> >> >>
> >> >> If we use proposal1, we can easily reuse these optimizations .It is
> even
> >> >> > possible to discuss and review the solution together in the Rocksdb
> >> >> > community.
> >> >>
> >> >> We also saw these useful optimizations which could be applied to
> ForSt
> >> in
> >> >> the future.
> >> >> But IIUC, it's not binding to proposal 1, right? We could also
> >> >> implement interfaces about temperature and secondary cache to reuse
> >> them,
> >> >> or organize a more complex HybridEnv based on proposal 2.
> >> >>
> >> >> My point is whether we should retain the potential of proposal 1 in
> the
> >> >> > design.
> >> >> >
> >> >> This is a good suggestion. We choose proposal 2 firstly due to its
> >> >> maintainability and scalability, especially because it could leverage
> >> all
> >> >> filesystems flink supported conveniently.
> >> >> Given the indelible advantage in performance, I think we could also
> >> >> consider proposal 1 as an optimization in the future.
> >> >> For the interface on the DB side, we could also expose more different
> >> Envs
> >> >> in the future.
> >> >>
> >> >>
> >> >> On Tue, Mar 19, 2024 at 9:14 PM yue ma <[email protected]> wrote:
> >> >>
> >> >> > Hi Hangxiang,
> >> >> >
> >> >> > Thanks for bringing this discussion.
> >> >> > I have a few questions about the Proposal you mentioned in the
> FLIP.
> >> >> >
> >> >> > The current conclusion is to use proposal 2, which is okay for me.
> My
> >> >> point
> >> >> > is whether we should retain the potential of proposal 1 in the
> design.
> >> >> > There are the following reasons:
> >> >> > 1. No JNI overhead, just like the Performance Part mentioned in
> Flip
> >> >> > 2. RocksDB currently also provides an interface for Env, and there
> are
> >> >> also
> >> >> > some implementations, such as HDFS-ENV, which seem to be easily
> >> scalable.
> >> >> > 3. The RocksDB community continues to support LSM for different
> >> storage
> >> >> > media, such as  Tiered Storage
> >> >> > <
> >> >> >
> >> >>
> >>
> https://github.com/facebook/rocksdb/wiki/Tiered-Storage-%28Experimental%29
> >> >> > >
> >> >> >       And some optimizations have been made for this scenario,
> such as
> >> >> Per
> >> >> > Key Placement Comparison
> >> >> > <
> https://rocksdb.org/blog/2022/11/09/time-aware-tiered-storage.html>.
> >> >> >      *Secondary cache
> >> >> > <
> >> >> >
> >> >>
> >>
> https://github.com/facebook/rocksdb/wiki/SecondaryCache-%28Experimental%29
> >> >> > >*,
> >> >> > similar to the Hybrid Block Cache mentioned in Flip-423
> >> >> >  If we use proposal1, we can easily reuse these optimizations .It
> is
> >> even
> >> >> > possible to discuss and review the solution together in the Rocksdb
> >> >> > community.
> >> >> >  In fact, we have already implemented some production practices
> using
> >> >> > Proposal1 internally. We have integrated HybridEnv, Tiered Storage,
> >> and
> >> >> > Secondary Cache on RocksDB and optimized the performance of
> Checkpoint
> >> >> and
> >> >> > State Restore. It seems work well for us.
> >> >> >
> >> >> > --
> >> >> > Best,
> >> >> > Yue
> >> >> >
> >> >>
> >> >>
> >> >> --
> >> >> Best,
> >> >> Hangxiang.
> >> >>
> >> >
> >> >
> >> >--
> >> >Best,
> >> >Hangxiang.
> >>
> >
> >
> >--
> >Best,
> >Hangxiang.
>


-- 
Best,
Hangxiang.

Re: [DISCUSS] FLIP-427: Disaggregated State Store

Reply via email to