If it's not in the main repository, could these be linked from either the docs as it would increase discoverability.
On Sun, 18 Apr 2021, 6:43 am Benjamin Blodgett, <benjaminblodg...@gmail.com> wrote: > That sounds like a great way to frame that and solve that issue! > > Sent from my iPhone > > > On Apr 17, 2021, at 3:01 PM, Wes McKinney <wesmck...@gmail.com> wrote: > > > > In principle, I don't see an issue with having a network of > > apache/arrow-* git repositories for Rust projects, so if the desire is > > to have a new GitHub repository for "revolution" crates (rewrites of > > more stable crates) versus the "evolution" crates, I think we could > > certainly do that. > > > >> On Sat, Apr 17, 2021 at 2:03 PM Evan Chan <e...@urbanlogiq.com> wrote: > >> > >> This sounds like really awesome work! > >> > >> If it is in its own repo, would that mean the current implementation in > Arrow would just be left there? > >> Good parquet support seems really important to have. > >> > >> Evan > >> > >>>> On Apr 17, 2021, at 3:14 AM, Andrew Lamb <al...@influxdata.com> > wrote: > >>> > >>> It sounds like exciting work Jorge -- Thank you for the update! > >>> > >>> I wonder what you hope to gain by bringing it to an ASF repo that you > can't > >>> get in your own repo? > >>> > >>> Perhaps you are ready to bring in other collaborators and wish to > ensure > >>> they have undergone the Apache IP clearance process? > >>> > >>> Andrew > >>> > >>> > >>> On Fri, Apr 16, 2021 at 12:22 PM Jorge Cardoso Leitão < > >>> jorgecarlei...@gmail.com> wrote: > >>> > >>>> Hi, > >>>> > >>>> As briefly discussed in a recent email thread, I have been > experimenting > >>>> with re-writing the Rust parquet implementation. I have not > advertised this > >>>> much as I was very sceptical that this would work. I am now confident > that > >>>> it can, and thus would like to share more details. > >>>> > >>>> parquet2 [1] is a rewrite of the parquet crate taking security, > >>>> performance, and parallelism as requirements. > >>>> > >>>> Here are the highlights so far: > >>>> > >>>> - Security: *no use of unsafe*. All invariants about memory and thread > >>>> safety are proven by the Rust compiler (an audit to its 3 mandatory + > 5 > >>>> optional compressors is still required). (compare e.g. ARROW-10920). > >>>> > >>>> - Performance: to the best of my benchmarking capabilities, *3-15x > faster* > >>>> than the parquet crate, both reading and writing to arrow. It has > about the > >>>> same performance as pyarrow/c++. These numbers correspond to a single > plain > >>>> page with 10% nulls and increase with increasing slot number / page > size > >>>> (which imo is a relevant unit of work). See [2] for plots, numbers and > >>>> references to exact commits. > >>>> > >>>> - Features: it reads parquet optional primitive types, V1 and V2, > >>>> dictionary- and non-dictionary pages, rep and def levels, and > metadata. It > >>>> reads 1-level nullable lists. It writes non-dictionary V1 pages with > PLAIN > >>>> and RLE encoding. No delta-encoding yet. No statistics yet. > >>>> > >>>> - Integration: it is integration-tested against parquet generated by > >>>> pyarrow==3, and round trip tests for the write. > >>>> > >>>> The public API is just functions and iterators generics. An important > >>>> design choice is that there is a strict separation between IO-bound > >>>> operations (read and seek) and CPU-bound operations (decompress, > decode, > >>>> deserialize). This gives consumers (read datafusion, polars, etc.) the > >>>> choice of deciding how they want to parallelize the work among > threads. > >>>> > >>>> I investigated async and AFAIU we first need to add support to it on > the > >>>> thrift crate [3], as it currently does not have an API to use the > >>>> futures::AsyncRead and futures::AsyncSeek traits. > >>>> > >>>> parquet2 is in-memory model -independent; it just exposes an API to > read > >>>> the parquet format according to the spec. It delegates to consumers > how to > >>>> deserialize the pages to it (I implemented it for arrow2 and native > rust), > >>>> offering a toolkit to help them. imo this is important because imo it > >>>> should be the in-memory representation to decide how to best convert a > >>>> decompressed page to memory. > >>>> > >>>> The development is happening on my own repo, but I was hoping to > bring it > >>>> to ASF (experimental repo?). if you think that Apache Arrow could be a > >>>> place to host it (Apache Parquet is another option?). > >>>> > >>>> [1] https://github.com/jorgecarleitao/parquet2 > >>>> [2] > >>>> > >>>> > https://docs.google.com/spreadsheets/d/12Sj1kjhadT-l0KXirexQDOocsLg-M4Ao1jnqXstCpx0/edit#gid=0 > >>>> [3] https://issues.apache.org/jira/browse/THRIFT-4777 > >>>> > >>>> Best, > >>>> Jorge > >>>> > >> >