Re: [Rust] [DataFusion] Reading remote parquet files in S3?

Jack Chan Mon, 15 Feb 2021 01:19:10 -0800

Thanks Andrew.

As you mentioned, the ChunkReader is flexible enough. So, what is missing
is a way to provider an parquet reader implementation of a customized
ChunkReader. Are there any examples within datafusion where people can
change the execution plan like this?


If I understand correctly, the steps cloudfuse-io took are 1. define a s3
parquet table provider. [1] 2. define a s3 parquet reader. [2] This does
confirm my understanding that creating your own remote parquet reader
requires lots of duplication.

[1]
https://github.com/cloudfuse-io/buzz-rust/blob/13175a7c5cdd298415889da710a254a218be0a01/code/src/datasource/hbee/s3_parquet.rs
[2]
https://github.com/cloudfuse-io/buzz-rust/blob/13175a7c5cdd298415889da710a254a218be0a01/code/src/execution_plan/parquet.rs

Jack

Andrew Lamb <al...@influxdata.com> 於 2021年2月14日週日 上午2:14寫道：

> The Buzz project is one example I know of that reads parquet files from S3
> using the Rust implementation
>
>
> https://github.com/cloudfuse-io/buzz-rust/blob/13175a7c5cdd298415889da710a254a218be0a01/code/src/execution_plan/parquet.rs
>
> The SerializedFileReader[1] from the Rust parquet crate, despite its
> somewhat misleading name, doesn't have to read from files, instead it reads
> from something that implements the ChunkReader [2] trait. I am not sure how
> well this matches what you are looking for.
>
> Hope that helps,
> Andrew
>
> [1]
> https://docs.rs/parquet/3.0.0/parquet/file/serialized_reader/struct.SerializedFileReader.html
> [2]
> https://docs.rs/parquet/3.0.0/parquet/file/reader/trait.ChunkReader.html
>
>
>
> On Sat, Feb 13, 2021 at 10:17 AM Steve Kim <chairm...@gmail.com> wrote:
>
>> > Currently, parquet.rs only supports local disk files. Potentially,
>> this can be done using the rusoto crate that provides a s3 client. What
>> would be a good way to do this?
>> > 1. create a remote parquet reader (potentially duplicate lots of code)
>> > 2. create an interface to abstract away reading from local/remote files
>> (not sure about performance if the reader blocks on every operation)
>>
>> This is a great question.
>>
>> I think that approach (2) is superior, although it requires more work
>> than approach (1) to design an interface that works well across
>> multiple file stores that have different performance characteristics.
>> To accommodate storage-specific performance optimizations, I expect
>> that the common interface will have to be more elaborate than the
>> current reader API.
>>
>> Is it possible for the Rust reader to use the c++ implementation
>> (https://github.com/apache/arrow/tree/master/cpp/src/arrow/filesystem)?
>> If this reuse of implementation is feasible, then we could focus
>> efforts on improving the c++ implementation and get the benefits in
>> Python, Rust, etc.
>>
>> In the Java ecosystem, the (non-Arrow, row-wise) Parquet reader uses
>> the Hadoop FileSystem abstraction. This abstraction is complex, leaky,
>> and not well specialized for read patterns that are typical for
>> Parquet files. We can learn from these mistakes to create a superior
>> reader interface in the Arrow/Parquet project.
>>
>> Steve
>>
>

Re: [Rust] [DataFusion] Reading remote parquet files in S3?

Reply via email to