This sounds like really awesome work!

If it is in its own repo, would that mean the current implementation in Arrow 
would just be left there?  
Good parquet support seems really important to have.

Evan

> On Apr 17, 2021, at 3:14 AM, Andrew Lamb <al...@influxdata.com> wrote:
> 
> It sounds like exciting work Jorge -- Thank you for the update!
> 
> I wonder what you hope to gain by bringing it to an ASF repo that you can't
> get in your own repo?
> 
> Perhaps you are ready to bring in other collaborators and wish to ensure
> they have undergone the Apache IP clearance process?
> 
> Andrew
> 
> 
> On Fri, Apr 16, 2021 at 12:22 PM Jorge Cardoso Leitão <
> jorgecarlei...@gmail.com> wrote:
> 
>> Hi,
>> 
>> As briefly discussed in a recent email thread, I have been experimenting
>> with re-writing the Rust parquet implementation. I have not advertised this
>> much as I was very sceptical that this would work. I am now confident that
>> it can, and thus would like to share more details.
>> 
>> parquet2 [1] is a rewrite of the parquet crate taking security,
>> performance, and parallelism as requirements.
>> 
>> Here are the highlights so far:
>> 
>> - Security: *no use of unsafe*. All invariants about memory and thread
>> safety are proven by the Rust compiler (an audit to its 3 mandatory + 5
>> optional compressors is still required). (compare e.g. ARROW-10920).
>> 
>> - Performance: to the best of my benchmarking capabilities, *3-15x faster*
>> than the parquet crate, both reading and writing to arrow. It has about the
>> same performance as pyarrow/c++. These numbers correspond to a single plain
>> page with 10% nulls and increase with increasing slot number / page size
>> (which imo is a relevant unit of work). See [2] for plots, numbers and
>> references to exact commits.
>> 
>> - Features: it reads parquet optional primitive types, V1 and V2,
>> dictionary- and non-dictionary pages, rep and def levels, and metadata. It
>> reads 1-level nullable lists. It writes non-dictionary V1 pages with PLAIN
>> and RLE encoding. No delta-encoding yet. No statistics yet.
>> 
>> - Integration: it is integration-tested against parquet generated by
>> pyarrow==3, and round trip tests for the write.
>> 
>> The public API is just functions and iterators generics. An important
>> design choice is that there is a strict separation between IO-bound
>> operations (read and seek) and CPU-bound operations (decompress, decode,
>> deserialize). This gives consumers (read datafusion, polars, etc.) the
>> choice of deciding how they want to parallelize the work among threads.
>> 
>> I investigated async and AFAIU we first need to add support to it on the
>> thrift crate [3], as it currently does not have an API to use the
>> futures::AsyncRead and futures::AsyncSeek traits.
>> 
>> parquet2 is in-memory model -independent; it just exposes an API to read
>> the parquet format according to the spec. It delegates to consumers how to
>> deserialize the pages to it (I implemented it for arrow2 and native rust),
>> offering a toolkit to help them. imo this is important because imo it
>> should be the in-memory representation to decide how to best convert a
>> decompressed page to memory.
>> 
>> The development is happening on my own repo, but I was hoping to bring it
>> to ASF (experimental repo?). if you think that Apache Arrow could be a
>> place to host it (Apache Parquet is another option?).
>> 
>> [1] https://github.com/jorgecarleitao/parquet2
>> [2]
>> 
>> https://docs.google.com/spreadsheets/d/12Sj1kjhadT-l0KXirexQDOocsLg-M4Ao1jnqXstCpx0/edit#gid=0
>> [3] https://issues.apache.org/jira/browse/THRIFT-4777
>> 
>> Best,
>> Jorge
>> 

Reply via email to