In principle, I don't see an issue with having a network of apache/arrow-* git repositories for Rust projects, so if the desire is to have a new GitHub repository for "revolution" crates (rewrites of more stable crates) versus the "evolution" crates, I think we could certainly do that.
On Sat, Apr 17, 2021 at 2:03 PM Evan Chan <e...@urbanlogiq.com> wrote: > > This sounds like really awesome work! > > If it is in its own repo, would that mean the current implementation in Arrow > would just be left there? > Good parquet support seems really important to have. > > Evan > > > On Apr 17, 2021, at 3:14 AM, Andrew Lamb <al...@influxdata.com> wrote: > > > > It sounds like exciting work Jorge -- Thank you for the update! > > > > I wonder what you hope to gain by bringing it to an ASF repo that you can't > > get in your own repo? > > > > Perhaps you are ready to bring in other collaborators and wish to ensure > > they have undergone the Apache IP clearance process? > > > > Andrew > > > > > > On Fri, Apr 16, 2021 at 12:22 PM Jorge Cardoso Leitão < > > jorgecarlei...@gmail.com> wrote: > > > >> Hi, > >> > >> As briefly discussed in a recent email thread, I have been experimenting > >> with re-writing the Rust parquet implementation. I have not advertised this > >> much as I was very sceptical that this would work. I am now confident that > >> it can, and thus would like to share more details. > >> > >> parquet2 [1] is a rewrite of the parquet crate taking security, > >> performance, and parallelism as requirements. > >> > >> Here are the highlights so far: > >> > >> - Security: *no use of unsafe*. All invariants about memory and thread > >> safety are proven by the Rust compiler (an audit to its 3 mandatory + 5 > >> optional compressors is still required). (compare e.g. ARROW-10920). > >> > >> - Performance: to the best of my benchmarking capabilities, *3-15x faster* > >> than the parquet crate, both reading and writing to arrow. It has about the > >> same performance as pyarrow/c++. These numbers correspond to a single plain > >> page with 10% nulls and increase with increasing slot number / page size > >> (which imo is a relevant unit of work). See [2] for plots, numbers and > >> references to exact commits. > >> > >> - Features: it reads parquet optional primitive types, V1 and V2, > >> dictionary- and non-dictionary pages, rep and def levels, and metadata. It > >> reads 1-level nullable lists. It writes non-dictionary V1 pages with PLAIN > >> and RLE encoding. No delta-encoding yet. No statistics yet. > >> > >> - Integration: it is integration-tested against parquet generated by > >> pyarrow==3, and round trip tests for the write. > >> > >> The public API is just functions and iterators generics. An important > >> design choice is that there is a strict separation between IO-bound > >> operations (read and seek) and CPU-bound operations (decompress, decode, > >> deserialize). This gives consumers (read datafusion, polars, etc.) the > >> choice of deciding how they want to parallelize the work among threads. > >> > >> I investigated async and AFAIU we first need to add support to it on the > >> thrift crate [3], as it currently does not have an API to use the > >> futures::AsyncRead and futures::AsyncSeek traits. > >> > >> parquet2 is in-memory model -independent; it just exposes an API to read > >> the parquet format according to the spec. It delegates to consumers how to > >> deserialize the pages to it (I implemented it for arrow2 and native rust), > >> offering a toolkit to help them. imo this is important because imo it > >> should be the in-memory representation to decide how to best convert a > >> decompressed page to memory. > >> > >> The development is happening on my own repo, but I was hoping to bring it > >> to ASF (experimental repo?). if you think that Apache Arrow could be a > >> place to host it (Apache Parquet is another option?). > >> > >> [1] https://github.com/jorgecarleitao/parquet2 > >> [2] > >> > >> https://docs.google.com/spreadsheets/d/12Sj1kjhadT-l0KXirexQDOocsLg-M4Ao1jnqXstCpx0/edit#gid=0 > >> [3] https://issues.apache.org/jira/browse/THRIFT-4777 > >> > >> Best, > >> Jorge > >> >