Hi Yun. Thanks for the great suggestion. I just added related information into the FLIP.
On Sat, Mar 30, 2024 at 10:49 AM Yun Tang <myas...@live.com> wrote: > Hi Feifan, > > I just replied in the discussion of FLIP-428. I agree that we could leave > the clean-up optimization in the future FLIP, however, I think we should > mention this topic explicitly in the current FLIP to make the overall > design complete and more sophisticated. > > Best > Yun Tang > ________________________________ > From: Feifan Wang <zoltar9...@163.com> > Sent: Thursday, March 28, 2024 12:35 > To: dev@flink.apache.org <dev@flink.apache.org> > Subject: Re: [DISCUSS] FLIP-427: Disaggregated State Store > > Thanks for your reply, Hangxiang. I totally agree with you about the jni > part. > > Hi Yun Tang, I just noticed that FLIP-427 mentions “The life cycle of > working dir is managed as before local strategy.” IIUC, the working dir > will be deleted after TaskManager exit. And I think that's enough for > current stage, WDYT ? > > —————————————— > > Best regards, > > Feifan Wang > > > > > At 2024-03-28 12:18:56, "Hangxiang Yu" <master...@gmail.com> wrote: > >Hi, Feifan. > > > >Thanks for your reply. > > > >What if we only use jni to access DFS that needs to reuse Flink > FileSystem? > >> And all local disk access through native api. This idea is based on the > >> understanding that jni overhead is not worth mentioning compared to DFS > >> access latency. It might make more sense to consider avoiding jni > overhead > >> for faster local disks. Since local disk as secondary is already under > >> consideration [1], maybe we can discuss in that FLIP whether to use > native > >> api to access local disk? > >> > >This is a good suggestion. It's reasonable to use native api to access > >local disk cache since it requires lower latency compared to remote > access. > >I also believe that the jni overhead is relatively negligible when weighed > >against the latency of remote I/O as mentioned in the FLIP. > >So I think we could just go on proposal 2 and keep proposal 1 as a > >potential future optimization, which could work better when there is a > >higher performance requirement or some native libraries of filesystems > have > >significantly higher performance and resource usage compared to their java > >libs. > > > > > >On Thu, Mar 28, 2024 at 11:39 AM Feifan Wang <zoltar9...@163.com> wrote: > > > >> Thanks for this valuable proposal Hangxiang ! > >> > >> > >> > If we need to introduce a JNI call during each filesystem call, that > >> would be N times JNI cost compared with the current RocksDB > state-backend's > >> JNI cost. > >> What if we only use jni to access DFS that needs to reuse Flink > >> FileSystem? And all local disk access through native api. This idea is > >> based on the understanding that jni overhead is not worth mentioning > >> compared to DFS access latency. It might make more sense to consider > >> avoiding jni overhead for faster local disks. Since local disk as > secondary > >> is already under consideration [1], maybe we can discuss in that FLIP > >> whether to use native api to access local disk? > >> > >> > >> >I'd suggest keeping `state.backend.forSt.working-dir` as it is for now. > >> >Different disaggregated state storages may have their own semantics > about > >> >this configuration, e.g. life cycle, supported file systems or > storages. > >> I agree with considering moving this configuration up to the engine > level > >> until there are other disaggreated backends. > >> > >> > >> [1] https://cwiki.apache.org/confluence/x/U4p3EQ > >> > >> —————————————— > >> > >> Best regards, > >> > >> Feifan Wang > >> > >> > >> > >> > >> At 2024-03-28 09:55:48, "Hangxiang Yu" <master...@gmail.com> wrote: > >> >Hi, Yun. > >> >Thanks for the reply. > >> > > >> >The JNI cost you considered is right. As replied to Yue, I agreed to > leave > >> >space and consider proposal 1 as an optimization in the future, which > is > >> >also updated in the FLIP. > >> > > >> >The other question is that the configuration of > >> >> `state.backend.forSt.working-dir` looks too coupled with the ForSt > >> >> state-backend, how would it be if we introduce another disaggregated > >> state > >> >> storage? Thus, I think `state.backend.disaggregated.working-dir` > might > >> be a > >> >> better configuration name. > >> > > >> >I'd suggest keeping `state.backend.forSt.working-dir` as it is for now. > >> >Different disaggregated state storages may have their own semantics > about > >> >this configuration, e.g. life cycle, supported file systems or > storages. > >> >Maybe it's more suitable to consider it together when we introduce > other > >> >disaggregated state storages in the future. > >> > > >> >On Thu, Mar 28, 2024 at 12:02 AM Yun Tang <myas...@live.com> wrote: > >> > > >> >> Hi Hangxiang, > >> >> > >> >> The design looks good, and I also support leaving space for proposal > 1. > >> >> > >> >> As you know, loading index/filter/data blocks for querying across > levels > >> >> would introduce high IO access within the LSM tree for old data. If > we > >> need > >> >> to introduce a JNI call during each filesystem call, that would be N > >> times > >> >> JNI cost compared with the current RocksDB state-backend's JNI cost. > >> >> > >> >> The other question is that the configuration of > >> >> `state.backend.forSt.working-dir` looks too coupled with the ForSt > >> >> state-backend, how would it be if we introduce another disaggregated > >> state > >> >> storage? Thus, I think `state.backend.disaggregated.working-dir` > might > >> be a > >> >> better configuration name. > >> >> > >> >> > >> >> Best > >> >> Yun Tang > >> >> > >> >> ________________________________ > >> >> From: Hangxiang Yu <master...@gmail.com> > >> >> Sent: Wednesday, March 20, 2024 11:32 > >> >> To: dev@flink.apache.org <dev@flink.apache.org> > >> >> Subject: Re: [DISCUSS] FLIP-427: Disaggregated State Store > >> >> > >> >> Hi, Yue. > >> >> Thanks for the reply. > >> >> > >> >> If we use proposal1, we can easily reuse these optimizations .It is > even > >> >> > possible to discuss and review the solution together in the Rocksdb > >> >> > community. > >> >> > >> >> We also saw these useful optimizations which could be applied to > ForSt > >> in > >> >> the future. > >> >> But IIUC, it's not binding to proposal 1, right? We could also > >> >> implement interfaces about temperature and secondary cache to reuse > >> them, > >> >> or organize a more complex HybridEnv based on proposal 2. > >> >> > >> >> My point is whether we should retain the potential of proposal 1 in > the > >> >> > design. > >> >> > > >> >> This is a good suggestion. We choose proposal 2 firstly due to its > >> >> maintainability and scalability, especially because it could leverage > >> all > >> >> filesystems flink supported conveniently. > >> >> Given the indelible advantage in performance, I think we could also > >> >> consider proposal 1 as an optimization in the future. > >> >> For the interface on the DB side, we could also expose more different > >> Envs > >> >> in the future. > >> >> > >> >> > >> >> On Tue, Mar 19, 2024 at 9:14 PM yue ma <mayuefi...@gmail.com> wrote: > >> >> > >> >> > Hi Hangxiang, > >> >> > > >> >> > Thanks for bringing this discussion. > >> >> > I have a few questions about the Proposal you mentioned in the > FLIP. > >> >> > > >> >> > The current conclusion is to use proposal 2, which is okay for me. > My > >> >> point > >> >> > is whether we should retain the potential of proposal 1 in the > design. > >> >> > There are the following reasons: > >> >> > 1. No JNI overhead, just like the Performance Part mentioned in > Flip > >> >> > 2. RocksDB currently also provides an interface for Env, and there > are > >> >> also > >> >> > some implementations, such as HDFS-ENV, which seem to be easily > >> scalable. > >> >> > 3. The RocksDB community continues to support LSM for different > >> storage > >> >> > media, such as Tiered Storage > >> >> > < > >> >> > > >> >> > >> > https://github.com/facebook/rocksdb/wiki/Tiered-Storage-%28Experimental%29 > >> >> > > > >> >> > And some optimizations have been made for this scenario, > such as > >> >> Per > >> >> > Key Placement Comparison > >> >> > < > https://rocksdb.org/blog/2022/11/09/time-aware-tiered-storage.html>. > >> >> > *Secondary cache > >> >> > < > >> >> > > >> >> > >> > https://github.com/facebook/rocksdb/wiki/SecondaryCache-%28Experimental%29 > >> >> > >*, > >> >> > similar to the Hybrid Block Cache mentioned in Flip-423 > >> >> > If we use proposal1, we can easily reuse these optimizations .It > is > >> even > >> >> > possible to discuss and review the solution together in the Rocksdb > >> >> > community. > >> >> > In fact, we have already implemented some production practices > using > >> >> > Proposal1 internally. We have integrated HybridEnv, Tiered Storage, > >> and > >> >> > Secondary Cache on RocksDB and optimized the performance of > Checkpoint > >> >> and > >> >> > State Restore. It seems work well for us. > >> >> > > >> >> > -- > >> >> > Best, > >> >> > Yue > >> >> > > >> >> > >> >> > >> >> -- > >> >> Best, > >> >> Hangxiang. > >> >> > >> > > >> > > >> >-- > >> >Best, > >> >Hangxiang. > >> > > > > > >-- > >Best, > >Hangxiang. > -- Best, Hangxiang.