Hello, You may want to interact with the Apache Iceberg community here. They are currently a similar things: https://lists.apache.org/thread.html/3bb4f89a0b37f474cf67915f91326fa845afa597bdd2463c98a2c8b9@%3Cdev.iceberg.apache.org%3E I'm not involved in this, just reading both mailing lists and thought I'd share this.
Cheers Uwe On Wed, Sep 4, 2019, at 7:24 PM, Chao Sun wrote: > Bumping this. > > We may have an upcoming use case for this as well. Want to know if anyone > is actively working on this? I also heard that Dremio has internally > implemented a performant Parquet to Arrow reader. Is there any plan to open > source it? that could save us a lot of work. > > Thanks, > Chao > > On Fri, Aug 9, 2019 at 8:49 AM Renjie Liu <liurenjie2...@gmail.com> wrote: > > > Hi: > > > > I'm working on the rust part and expecting to finish this recently. I'm > > also interested in the java version because we are trying to embed arrow in > > spark to implement vectorized processing. Maybe we can work together. > > > > Micah Kornfield <emkornfi...@gmail.com> 于 2019年8月5日周一 下午1:50写道: > > > > > Hi Anoop, > > > I think a contribution would be welcome. There was a recent discussion > > > thread on what would be expected from new "readers" for Arrow data in > > Java > > > [1]. I think its worth reading through but my recollections of the > > > highlights are: > > > 1. A short design sketch in the JIRA that will track the work. > > > 2. Off-heap data-structures as much as possible > > > 3. An interface that allows predicate push down, column projection and > > > specifying the batch sizes of reads. I think there is probably some > > > interplay here between RowGroup size and size of batches. It might worth > > > thinking about this up front and mentioning in the design. > > > 4. Performant (since we care going from columnar->columar it should be > > > faster then Parquet-MR and on-par or better then Spark's implementation > > > which I believe also goes from columnar to columnar). > > > > > > Answers to specific questions below. > > > > > > Thanks, > > > Micah > > > > > > To help me get started, are there any pointers on how the C++ or Rust > > > > implementations currently read Parquet into Arrow? > > > > > > I'm not sure about the Rust code, but the C++ code is located at [2], it > > is > > > has been going under some recent refactoring (and I think Wes might have > > 1 > > > or 2 changes till to make). It doesn't yet support nested data types > > fully > > > (e.g. structs). > > > > > > Are they reading Parquet row-by-row and building Arrow batches or are > > there > > > > better ways of implementing this? > > > > > > I believe the implementations should be reading a row-group at a time > > > column by column. Spark potentially has an implementation that already > > > does this. > > > > > > > > > [1] > > > > > > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.apache.org_thread.html_b096528600e66c17af9498c151352f12944ead2fd218a0257fdd4f70-40-253Cdev.arrow.apache.org-253E&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=61WUGQdh6WmssHGmn9FHXw&m=JP24vxBLI1Pwr36FB0vYB769jlgftgc8Bik-P09fJZA&s=fKQYzdomRi1K0lheZFD-gZ59TaHAGaBDzJFApTekkt0&e= > > > [2] > > https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_arrow_tree_master_cpp_src_parquet_arrow&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=61WUGQdh6WmssHGmn9FHXw&m=JP24vxBLI1Pwr36FB0vYB769jlgftgc8Bik-P09fJZA&s=7AhyWgqaneMxlWeFA3EKYaygl0RpkC-nVSungqZaVqg&e= > > > > > > On Sun, Aug 4, 2019 at 2:52 PM Anoop Johnson <anoop.k.john...@gmail.com> > > > wrote: > > > > > > > Thanks for the response Micah. I could implement this and contribute to > > > > Arrow Java. To help me get started, are there any pointers on how the > > C++ > > > > or Rust implementations currently read Parquet into Arrow? Are they > > > reading > > > > Parquet row-by-row and building Arrow batches or are there better ways > > of > > > > implementing this? > > > > > > > > On Tue, Jul 30, 2019 at 1:56 PM Micah Kornfield <emkornfi...@gmail.com > > > > > > > wrote: > > > > > > > >> Hi Anoop, > > > >> There isn't currently anything in the Arrow Java library that does > > this. > > > >> It is something that I think we want to add at some point. Dremio > > [1] > > > >> has > > > >> some Parquet related code, but I haven't looked at it to understand > > how > > > >> easy it is to use as a standalone library and whether is supports > > > >> predicate > > > >> push-down/column selection. > > > >> > > > >> Thanks, > > > >> Micah > > > >> > > > >> [1] > > > >> > > > >> > > > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dremio_dremio-2Doss_tree_master_sabot_kernel_src_main_java_com_dremio_exec_store_parquet&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=61WUGQdh6WmssHGmn9FHXw&m=JP24vxBLI1Pwr36FB0vYB769jlgftgc8Bik-P09fJZA&s=DHkMQ-raZ__SnOH71hieIiLMlPYdEWY_7pqeMowp6wU&e= > > > >> > > > >> On Sun, Jul 28, 2019 at 2:08 PM Anoop Johnson < > > > anoop.k.john...@gmail.com> > > > >> wrote: > > > >> > > > >> > Arrow Newbie here. What is the recommended way to convert Parquet > > > data > > > >> > into Arrow, preferably doing predicate/column pushdown? > > > >> > > > > >> > One can implement this as custom code using the Parquet API, and > > > >> re-encode > > > >> > it in Arrow using the Arrow APIs, but is this supported by Arrow out > > > of > > > >> the > > > >> > box? > > > >> > > > > >> > Thanks, > > > >> > Anoop > > > >> > > > > >> > > > > > > > > > >