Hi, Jeyhun. Thanks for the reply.
Is this argument true for all workloads? Or does this argument also hold for workloads with many small files, which is quite a common case [1] ? Yes, I think so. The overhead should still be considered negligible, particularly in comparison to remote I/O, and other benefits of this proposal may be more significant than this one. Additionally, there is JNI overhead when Flink calls RocksDB methods currently. The frequency of these calls could surpass that of actual file system interface calls, given that not all state requests require accessing the file system. BTW, the issue with small files can also impact the performance of db with the local file system at runtime, so we usually resolve this firstly in the production environment. the engine spawns huge amount of scan range requests to the file system to retrieve different parts of a file. Indeed, frequent requests to the remote file system can significantly affect performance. To address this, other FLIPs have introduced various strategies: 1. Local disk cache to minimize remote requests as described in FLIP-423 which we will introduce in FLIP-429 as you mentioned. With effective cache utilization, the performance will not be inferior to the local strategy when cache hits. 2. Grouping remote access to decrease the number of remote I/O requests, as proposed in "FLIP-426: Grouping Remote State Access." 3. Parallel I/O to maximize network bandwidth usage, outlined in "FLIP-425: Asynchronous Execution Model." The PoC implements a simple file cache and asynchronous execution which improves the performance a lot. You could also refer to the PoC results in FLIP-423. On Mon, Mar 11, 2024 at 3:11 AM Jeyhun Karimov <je.kari...@gmail.com> wrote: > Hi Hangxiang, > > Thanks for the proposal. +1 for it. > I have a few comments. > > Proposal 2 has additional JNI overhead, but the overhead is relatively > > negligible when weighed against the latency of remote I/O. > > - Is this argument true for all workloads? Or does this argument also hold > for workloads with many small files, which is quite a common case [1] ? > > - Also, in many workloads the engine does not need the whole file either > because of the query forces it or > file type supports efficient filtering (e.g. ORC, parquet, arrow files), or > simply one file is "divided" among multiple workers. > In these cases, the engine spawns huge amount of scan range requests to the > file system to retrieve different parts of a file. > How the proposed solution would work with these workloads? > > - The similar question related to the above applies also for caching ( I > know caching is subject of FLIP-429, asking here becasue of the related > section in this FLIP). > > Regards, > Jeyhun > > [1] https://blog.min.io/challenge-big-data-small-files/ > > > > On Thu, Mar 7, 2024 at 10:09 AM Hangxiang Yu <master...@gmail.com> wrote: > > > Hi devs, > > > > > > I'd like to start a discussion on a sub-FLIP of FLIP-423: Disaggregated > > State Storage and Management[1], which is a joint work of Yuan Mei, > Zakelly > > Lan, Jinzhong Li, Hangxiang Yu, Yanfei Lei and Feng Wang: > > > > - FLIP-427: Disaggregated State Store > > > > This FLIP introduces the initial version of the ForSt disaggregated state > > store. > > > > Please make sure you have read the FLIP-423[1] to know the whole story, > and > > we'll discuss the details of FLIP-427[2] under this mail. For the > > discussion of overall architecture or topics related with multiple > > sub-FLIPs, please post in the previous mail[3]. > > > > Looking forward to hearing from you! > > > > [1] https://cwiki.apache.org/confluence/x/R4p3EQ > > > > [2] https://cwiki.apache.org/confluence/x/T4p3EQ > > > > [3] https://lists.apache.org/thread/ct8smn6g9y0b8730z7rp9zfpnwmj8vf0 > > > > > > Best, > > > > Hangxiang. > > > -- Best, Hangxiang.