In principle, I don't see an issue with having a network of
apache/arrow-* git repositories for Rust projects, so if the desire is
to have a new GitHub repository for "revolution" crates (rewrites of
more stable crates) versus the "evolution" crates, I think we could
certainly do that.

On Sat, Apr 17, 2021 at 2:03 PM Evan Chan <e...@urbanlogiq.com> wrote:
>
> This sounds like really awesome work!
>
> If it is in its own repo, would that mean the current implementation in Arrow 
> would just be left there?
> Good parquet support seems really important to have.
>
> Evan
>
> > On Apr 17, 2021, at 3:14 AM, Andrew Lamb <al...@influxdata.com> wrote:
> >
> > It sounds like exciting work Jorge -- Thank you for the update!
> >
> > I wonder what you hope to gain by bringing it to an ASF repo that you can't
> > get in your own repo?
> >
> > Perhaps you are ready to bring in other collaborators and wish to ensure
> > they have undergone the Apache IP clearance process?
> >
> > Andrew
> >
> >
> > On Fri, Apr 16, 2021 at 12:22 PM Jorge Cardoso Leitão <
> > jorgecarlei...@gmail.com> wrote:
> >
> >> Hi,
> >>
> >> As briefly discussed in a recent email thread, I have been experimenting
> >> with re-writing the Rust parquet implementation. I have not advertised this
> >> much as I was very sceptical that this would work. I am now confident that
> >> it can, and thus would like to share more details.
> >>
> >> parquet2 [1] is a rewrite of the parquet crate taking security,
> >> performance, and parallelism as requirements.
> >>
> >> Here are the highlights so far:
> >>
> >> - Security: *no use of unsafe*. All invariants about memory and thread
> >> safety are proven by the Rust compiler (an audit to its 3 mandatory + 5
> >> optional compressors is still required). (compare e.g. ARROW-10920).
> >>
> >> - Performance: to the best of my benchmarking capabilities, *3-15x faster*
> >> than the parquet crate, both reading and writing to arrow. It has about the
> >> same performance as pyarrow/c++. These numbers correspond to a single plain
> >> page with 10% nulls and increase with increasing slot number / page size
> >> (which imo is a relevant unit of work). See [2] for plots, numbers and
> >> references to exact commits.
> >>
> >> - Features: it reads parquet optional primitive types, V1 and V2,
> >> dictionary- and non-dictionary pages, rep and def levels, and metadata. It
> >> reads 1-level nullable lists. It writes non-dictionary V1 pages with PLAIN
> >> and RLE encoding. No delta-encoding yet. No statistics yet.
> >>
> >> - Integration: it is integration-tested against parquet generated by
> >> pyarrow==3, and round trip tests for the write.
> >>
> >> The public API is just functions and iterators generics. An important
> >> design choice is that there is a strict separation between IO-bound
> >> operations (read and seek) and CPU-bound operations (decompress, decode,
> >> deserialize). This gives consumers (read datafusion, polars, etc.) the
> >> choice of deciding how they want to parallelize the work among threads.
> >>
> >> I investigated async and AFAIU we first need to add support to it on the
> >> thrift crate [3], as it currently does not have an API to use the
> >> futures::AsyncRead and futures::AsyncSeek traits.
> >>
> >> parquet2 is in-memory model -independent; it just exposes an API to read
> >> the parquet format according to the spec. It delegates to consumers how to
> >> deserialize the pages to it (I implemented it for arrow2 and native rust),
> >> offering a toolkit to help them. imo this is important because imo it
> >> should be the in-memory representation to decide how to best convert a
> >> decompressed page to memory.
> >>
> >> The development is happening on my own repo, but I was hoping to bring it
> >> to ASF (experimental repo?). if you think that Apache Arrow could be a
> >> place to host it (Apache Parquet is another option?).
> >>
> >> [1] https://github.com/jorgecarleitao/parquet2
> >> [2]
> >>
> >> https://docs.google.com/spreadsheets/d/12Sj1kjhadT-l0KXirexQDOocsLg-M4Ao1jnqXstCpx0/edit#gid=0
> >> [3] https://issues.apache.org/jira/browse/THRIFT-4777
> >>
> >> Best,
> >> Jorge
> >>
>

Reply via email to