Thanks for your reply, Hangxiang. I totally agree with you about the jni part.
Hi Yun Tang, I just noticed that FLIP-427 mentions “The life cycle of working dir is managed as before local strategy.” IIUC, the working dir will be deleted after TaskManager exit. And I think that's enough for current stage, WDYT ? —————————————— Best regards, Feifan Wang At 2024-03-28 12:18:56, "Hangxiang Yu" <master...@gmail.com> wrote: >Hi, Feifan. > >Thanks for your reply. > >What if we only use jni to access DFS that needs to reuse Flink FileSystem? >> And all local disk access through native api. This idea is based on the >> understanding that jni overhead is not worth mentioning compared to DFS >> access latency. It might make more sense to consider avoiding jni overhead >> for faster local disks. Since local disk as secondary is already under >> consideration [1], maybe we can discuss in that FLIP whether to use native >> api to access local disk? >> >This is a good suggestion. It's reasonable to use native api to access >local disk cache since it requires lower latency compared to remote access. >I also believe that the jni overhead is relatively negligible when weighed >against the latency of remote I/O as mentioned in the FLIP. >So I think we could just go on proposal 2 and keep proposal 1 as a >potential future optimization, which could work better when there is a >higher performance requirement or some native libraries of filesystems have >significantly higher performance and resource usage compared to their java >libs. > > >On Thu, Mar 28, 2024 at 11:39 AM Feifan Wang <zoltar9...@163.com> wrote: > >> Thanks for this valuable proposal Hangxiang ! >> >> >> > If we need to introduce a JNI call during each filesystem call, that >> would be N times JNI cost compared with the current RocksDB state-backend's >> JNI cost. >> What if we only use jni to access DFS that needs to reuse Flink >> FileSystem? And all local disk access through native api. This idea is >> based on the understanding that jni overhead is not worth mentioning >> compared to DFS access latency. It might make more sense to consider >> avoiding jni overhead for faster local disks. Since local disk as secondary >> is already under consideration [1], maybe we can discuss in that FLIP >> whether to use native api to access local disk? >> >> >> >I'd suggest keeping `state.backend.forSt.working-dir` as it is for now. >> >Different disaggregated state storages may have their own semantics about >> >this configuration, e.g. life cycle, supported file systems or storages. >> I agree with considering moving this configuration up to the engine level >> until there are other disaggreated backends. >> >> >> [1] https://cwiki.apache.org/confluence/x/U4p3EQ >> >> —————————————— >> >> Best regards, >> >> Feifan Wang >> >> >> >> >> At 2024-03-28 09:55:48, "Hangxiang Yu" <master...@gmail.com> wrote: >> >Hi, Yun. >> >Thanks for the reply. >> > >> >The JNI cost you considered is right. As replied to Yue, I agreed to leave >> >space and consider proposal 1 as an optimization in the future, which is >> >also updated in the FLIP. >> > >> >The other question is that the configuration of >> >> `state.backend.forSt.working-dir` looks too coupled with the ForSt >> >> state-backend, how would it be if we introduce another disaggregated >> state >> >> storage? Thus, I think `state.backend.disaggregated.working-dir` might >> be a >> >> better configuration name. >> > >> >I'd suggest keeping `state.backend.forSt.working-dir` as it is for now. >> >Different disaggregated state storages may have their own semantics about >> >this configuration, e.g. life cycle, supported file systems or storages. >> >Maybe it's more suitable to consider it together when we introduce other >> >disaggregated state storages in the future. >> > >> >On Thu, Mar 28, 2024 at 12:02 AM Yun Tang <myas...@live.com> wrote: >> > >> >> Hi Hangxiang, >> >> >> >> The design looks good, and I also support leaving space for proposal 1. >> >> >> >> As you know, loading index/filter/data blocks for querying across levels >> >> would introduce high IO access within the LSM tree for old data. If we >> need >> >> to introduce a JNI call during each filesystem call, that would be N >> times >> >> JNI cost compared with the current RocksDB state-backend's JNI cost. >> >> >> >> The other question is that the configuration of >> >> `state.backend.forSt.working-dir` looks too coupled with the ForSt >> >> state-backend, how would it be if we introduce another disaggregated >> state >> >> storage? Thus, I think `state.backend.disaggregated.working-dir` might >> be a >> >> better configuration name. >> >> >> >> >> >> Best >> >> Yun Tang >> >> >> >> ________________________________ >> >> From: Hangxiang Yu <master...@gmail.com> >> >> Sent: Wednesday, March 20, 2024 11:32 >> >> To: dev@flink.apache.org <dev@flink.apache.org> >> >> Subject: Re: [DISCUSS] FLIP-427: Disaggregated State Store >> >> >> >> Hi, Yue. >> >> Thanks for the reply. >> >> >> >> If we use proposal1, we can easily reuse these optimizations .It is even >> >> > possible to discuss and review the solution together in the Rocksdb >> >> > community. >> >> >> >> We also saw these useful optimizations which could be applied to ForSt >> in >> >> the future. >> >> But IIUC, it's not binding to proposal 1, right? We could also >> >> implement interfaces about temperature and secondary cache to reuse >> them, >> >> or organize a more complex HybridEnv based on proposal 2. >> >> >> >> My point is whether we should retain the potential of proposal 1 in the >> >> > design. >> >> > >> >> This is a good suggestion. We choose proposal 2 firstly due to its >> >> maintainability and scalability, especially because it could leverage >> all >> >> filesystems flink supported conveniently. >> >> Given the indelible advantage in performance, I think we could also >> >> consider proposal 1 as an optimization in the future. >> >> For the interface on the DB side, we could also expose more different >> Envs >> >> in the future. >> >> >> >> >> >> On Tue, Mar 19, 2024 at 9:14 PM yue ma <mayuefi...@gmail.com> wrote: >> >> >> >> > Hi Hangxiang, >> >> > >> >> > Thanks for bringing this discussion. >> >> > I have a few questions about the Proposal you mentioned in the FLIP. >> >> > >> >> > The current conclusion is to use proposal 2, which is okay for me. My >> >> point >> >> > is whether we should retain the potential of proposal 1 in the design. >> >> > There are the following reasons: >> >> > 1. No JNI overhead, just like the Performance Part mentioned in Flip >> >> > 2. RocksDB currently also provides an interface for Env, and there are >> >> also >> >> > some implementations, such as HDFS-ENV, which seem to be easily >> scalable. >> >> > 3. The RocksDB community continues to support LSM for different >> storage >> >> > media, such as Tiered Storage >> >> > < >> >> > >> >> >> https://github.com/facebook/rocksdb/wiki/Tiered-Storage-%28Experimental%29 >> >> > > >> >> > And some optimizations have been made for this scenario, such as >> >> Per >> >> > Key Placement Comparison >> >> > <https://rocksdb.org/blog/2022/11/09/time-aware-tiered-storage.html>. >> >> > *Secondary cache >> >> > < >> >> > >> >> >> https://github.com/facebook/rocksdb/wiki/SecondaryCache-%28Experimental%29 >> >> > >*, >> >> > similar to the Hybrid Block Cache mentioned in Flip-423 >> >> > If we use proposal1, we can easily reuse these optimizations .It is >> even >> >> > possible to discuss and review the solution together in the Rocksdb >> >> > community. >> >> > In fact, we have already implemented some production practices using >> >> > Proposal1 internally. We have integrated HybridEnv, Tiered Storage, >> and >> >> > Secondary Cache on RocksDB and optimized the performance of Checkpoint >> >> and >> >> > State Restore. It seems work well for us. >> >> > >> >> > -- >> >> > Best, >> >> > Yue >> >> > >> >> >> >> >> >> -- >> >> Best, >> >> Hangxiang. >> >> >> > >> > >> >-- >> >Best, >> >Hangxiang. >> > > >-- >Best, >Hangxiang.