Thanks Andrew. As you mentioned, the ChunkReader is flexible enough. So, what is missing is a way to provider an parquet reader implementation of a customized ChunkReader. Are there any examples within datafusion where people can change the execution plan like this?
If I understand correctly, the steps cloudfuse-io took are 1. define a s3 parquet table provider. [1] 2. define a s3 parquet reader. [2] This does confirm my understanding that creating your own remote parquet reader requires lots of duplication. [1] https://github.com/cloudfuse-io/buzz-rust/blob/13175a7c5cdd298415889da710a254a218be0a01/code/src/datasource/hbee/s3_parquet.rs [2] https://github.com/cloudfuse-io/buzz-rust/blob/13175a7c5cdd298415889da710a254a218be0a01/code/src/execution_plan/parquet.rs Jack Andrew Lamb <al...@influxdata.com> 於 2021年2月14日週日 上午2:14寫道: > The Buzz project is one example I know of that reads parquet files from S3 > using the Rust implementation > > > https://github.com/cloudfuse-io/buzz-rust/blob/13175a7c5cdd298415889da710a254a218be0a01/code/src/execution_plan/parquet.rs > > The SerializedFileReader[1] from the Rust parquet crate, despite its > somewhat misleading name, doesn't have to read from files, instead it reads > from something that implements the ChunkReader [2] trait. I am not sure how > well this matches what you are looking for. > > Hope that helps, > Andrew > > [1] > https://docs.rs/parquet/3.0.0/parquet/file/serialized_reader/struct.SerializedFileReader.html > [2] > https://docs.rs/parquet/3.0.0/parquet/file/reader/trait.ChunkReader.html > > > > On Sat, Feb 13, 2021 at 10:17 AM Steve Kim <chairm...@gmail.com> wrote: > >> > Currently, parquet.rs only supports local disk files. Potentially, >> this can be done using the rusoto crate that provides a s3 client. What >> would be a good way to do this? >> > 1. create a remote parquet reader (potentially duplicate lots of code) >> > 2. create an interface to abstract away reading from local/remote files >> (not sure about performance if the reader blocks on every operation) >> >> This is a great question. >> >> I think that approach (2) is superior, although it requires more work >> than approach (1) to design an interface that works well across >> multiple file stores that have different performance characteristics. >> To accommodate storage-specific performance optimizations, I expect >> that the common interface will have to be more elaborate than the >> current reader API. >> >> Is it possible for the Rust reader to use the c++ implementation >> (https://github.com/apache/arrow/tree/master/cpp/src/arrow/filesystem)? >> If this reuse of implementation is feasible, then we could focus >> efforts on improving the c++ implementation and get the benefits in >> Python, Rust, etc. >> >> In the Java ecosystem, the (non-Arrow, row-wise) Parquet reader uses >> the Hadoop FileSystem abstraction. This abstraction is complex, leaky, >> and not well specialized for read patterns that are typical for >> Parquet files. We can learn from these mistakes to create a superior >> reader interface in the Arrow/Parquet project. >> >> Steve >> >