Re: [RUST] parquet2 experiment

Josh Taylor Sat, 17 Apr 2021 21:36:56 -0700

If it's not in the main repository, could these be linked from either the
docs as it would increase discoverability.


On Sun, 18 Apr 2021, 6:43 am Benjamin Blodgett, <benjaminblodg...@gmail.com>
wrote:

> That sounds like a great way to frame that and solve that issue!
>
> Sent from my iPhone
>
> > On Apr 17, 2021, at 3:01 PM, Wes McKinney <wesmck...@gmail.com> wrote:
> >
> > In principle, I don't see an issue with having a network of
> > apache/arrow-* git repositories for Rust projects, so if the desire is
> > to have a new GitHub repository for "revolution" crates (rewrites of
> > more stable crates) versus the "evolution" crates, I think we could
> > certainly do that.
> >
> >> On Sat, Apr 17, 2021 at 2:03 PM Evan Chan <e...@urbanlogiq.com> wrote:
> >>
> >> This sounds like really awesome work!
> >>
> >> If it is in its own repo, would that mean the current implementation in
> Arrow would just be left there?
> >> Good parquet support seems really important to have.
> >>
> >> Evan
> >>
> >>>> On Apr 17, 2021, at 3:14 AM, Andrew Lamb <al...@influxdata.com>
> wrote:
> >>>
> >>> It sounds like exciting work Jorge -- Thank you for the update!
> >>>
> >>> I wonder what you hope to gain by bringing it to an ASF repo that you
> can't
> >>> get in your own repo?
> >>>
> >>> Perhaps you are ready to bring in other collaborators and wish to
> ensure
> >>> they have undergone the Apache IP clearance process?
> >>>
> >>> Andrew
> >>>
> >>>
> >>> On Fri, Apr 16, 2021 at 12:22 PM Jorge Cardoso Leitão <
> >>> jorgecarlei...@gmail.com> wrote:
> >>>
> >>>> Hi,
> >>>>
> >>>> As briefly discussed in a recent email thread, I have been
> experimenting
> >>>> with re-writing the Rust parquet implementation. I have not
> advertised this
> >>>> much as I was very sceptical that this would work. I am now confident
> that
> >>>> it can, and thus would like to share more details.
> >>>>
> >>>> parquet2 [1] is a rewrite of the parquet crate taking security,
> >>>> performance, and parallelism as requirements.
> >>>>
> >>>> Here are the highlights so far:
> >>>>
> >>>> - Security: *no use of unsafe*. All invariants about memory and thread
> >>>> safety are proven by the Rust compiler (an audit to its 3 mandatory +
> 5
> >>>> optional compressors is still required). (compare e.g. ARROW-10920).
> >>>>
> >>>> - Performance: to the best of my benchmarking capabilities, *3-15x
> faster*
> >>>> than the parquet crate, both reading and writing to arrow. It has
> about the
> >>>> same performance as pyarrow/c++. These numbers correspond to a single
> plain
> >>>> page with 10% nulls and increase with increasing slot number / page
> size
> >>>> (which imo is a relevant unit of work). See [2] for plots, numbers and
> >>>> references to exact commits.
> >>>>
> >>>> - Features: it reads parquet optional primitive types, V1 and V2,
> >>>> dictionary- and non-dictionary pages, rep and def levels, and
> metadata. It
> >>>> reads 1-level nullable lists. It writes non-dictionary V1 pages with
> PLAIN
> >>>> and RLE encoding. No delta-encoding yet. No statistics yet.
> >>>>
> >>>> - Integration: it is integration-tested against parquet generated by
> >>>> pyarrow==3, and round trip tests for the write.
> >>>>
> >>>> The public API is just functions and iterators generics. An important
> >>>> design choice is that there is a strict separation between IO-bound
> >>>> operations (read and seek) and CPU-bound operations (decompress,
> decode,
> >>>> deserialize). This gives consumers (read datafusion, polars, etc.) the
> >>>> choice of deciding how they want to parallelize the work among
> threads.
> >>>>
> >>>> I investigated async and AFAIU we first need to add support to it on
> the
> >>>> thrift crate [3], as it currently does not have an API to use the
> >>>> futures::AsyncRead and futures::AsyncSeek traits.
> >>>>
> >>>> parquet2 is in-memory model -independent; it just exposes an API to
> read
> >>>> the parquet format according to the spec. It delegates to consumers
> how to
> >>>> deserialize the pages to it (I implemented it for arrow2 and native
> rust),
> >>>> offering a toolkit to help them. imo this is important because imo it
> >>>> should be the in-memory representation to decide how to best convert a
> >>>> decompressed page to memory.
> >>>>
> >>>> The development is happening on my own repo, but I was hoping to
> bring it
> >>>> to ASF (experimental repo?). if you think that Apache Arrow could be a
> >>>> place to host it (Apache Parquet is another option?).
> >>>>
> >>>> [1] https://github.com/jorgecarleitao/parquet2
> >>>> [2]
> >>>>
> >>>>
> https://docs.google.com/spreadsheets/d/12Sj1kjhadT-l0KXirexQDOocsLg-M4Ao1jnqXstCpx0/edit#gid=0
> >>>> [3] https://issues.apache.org/jira/browse/THRIFT-4777
> >>>>
> >>>> Best,
> >>>> Jorge
> >>>>
> >>
>

Re: [RUST] parquet2 experiment

Reply via email to