I don't know of any examples in the DataFusion codebase that take a ChunkReader directly
The cloudfuse-io code defines the ChunkReader trait for the `CachedFile` here : https://github.com/cloudfuse-io/buzz-rust/blob/13175a7c5cdd298415889da710a254a218be0a01/code/src/clients/cached_file.rs#L10 On Mon, Feb 15, 2021 at 4:19 AM Jack Chan <j4ck....@gmail.com> wrote: > Thanks Andrew. > > As you mentioned, the ChunkReader is flexible enough. So, what is missing > is a way to provider an parquet reader implementation of a customized > ChunkReader. Are there any examples within datafusion where people can > change the execution plan like this? > > If I understand correctly, the steps cloudfuse-io took are 1. define a s3 > parquet table provider. [1] 2. define a s3 parquet reader. [2] This does > confirm my understanding that creating your own remote parquet reader > requires lots of duplication. > > [1] > https://github.com/cloudfuse-io/buzz-rust/blob/13175a7c5cdd298415889da710a254a218be0a01/code/src/datasource/hbee/s3_parquet.rs > [2] > https://github.com/cloudfuse-io/buzz-rust/blob/13175a7c5cdd298415889da710a254a218be0a01/code/src/execution_plan/parquet.rs > > Jack > > Andrew Lamb <al...@influxdata.com> 於 2021年2月14日週日 上午2:14寫道: > >> The Buzz project is one example I know of that reads parquet files from >> S3 using the Rust implementation >> >> >> https://github.com/cloudfuse-io/buzz-rust/blob/13175a7c5cdd298415889da710a254a218be0a01/code/src/execution_plan/parquet.rs >> >> The SerializedFileReader[1] from the Rust parquet crate, despite its >> somewhat misleading name, doesn't have to read from files, instead it reads >> from something that implements the ChunkReader [2] trait. I am not sure how >> well this matches what you are looking for. >> >> Hope that helps, >> Andrew >> >> [1] >> https://docs.rs/parquet/3.0.0/parquet/file/serialized_reader/struct.SerializedFileReader.html >> [2] >> https://docs.rs/parquet/3.0.0/parquet/file/reader/trait.ChunkReader.html >> >> >> >> On Sat, Feb 13, 2021 at 10:17 AM Steve Kim <chairm...@gmail.com> wrote: >> >>> > Currently, parquet.rs only supports local disk files. Potentially, >>> this can be done using the rusoto crate that provides a s3 client. What >>> would be a good way to do this? >>> > 1. create a remote parquet reader (potentially duplicate lots of code) >>> > 2. create an interface to abstract away reading from local/remote >>> files (not sure about performance if the reader blocks on every operation) >>> >>> This is a great question. >>> >>> I think that approach (2) is superior, although it requires more work >>> than approach (1) to design an interface that works well across >>> multiple file stores that have different performance characteristics. >>> To accommodate storage-specific performance optimizations, I expect >>> that the common interface will have to be more elaborate than the >>> current reader API. >>> >>> Is it possible for the Rust reader to use the c++ implementation >>> (https://github.com/apache/arrow/tree/master/cpp/src/arrow/filesystem)? >>> If this reuse of implementation is feasible, then we could focus >>> efforts on improving the c++ implementation and get the benefits in >>> Python, Rust, etc. >>> >>> In the Java ecosystem, the (non-Arrow, row-wise) Parquet reader uses >>> the Hadoop FileSystem abstraction. This abstraction is complex, leaky, >>> and not well specialized for read patterns that are typical for >>> Parquet files. We can learn from these mistakes to create a superior >>> reader interface in the Arrow/Parquet project. >>> >>> Steve >>> >>