Re: Parquet to Arrow in Java

Uwe L. Korn Wed, 04 Sep 2019 10:46:17 -0700

Hello,

You may want to interact with the Apache Iceberg community here. They are 
currently a similar things: 
https://lists.apache.org/thread.html/3bb4f89a0b37f474cf67915f91326fa845afa597bdd2463c98a2c8b9@%3Cdev.iceberg.apache.org%3E
 I'm not involved in this, just reading both mailing lists and thought I'd 
share this.


Cheers
Uwe

On Wed, Sep 4, 2019, at 7:24 PM, Chao Sun wrote:
> Bumping this.
> 
> We may have an upcoming use case for this as well. Want to know if anyone
> is actively working on this? I also heard that Dremio has internally
> implemented a performant Parquet to Arrow reader. Is there any plan to open
> source it? that could save us a lot of work.
> 
> Thanks,
> Chao
> 
> On Fri, Aug 9, 2019 at 8:49 AM Renjie Liu <liurenjie2...@gmail.com> wrote:
> 
> > Hi:
> >
> > I'm working on the rust part and expecting to finish this recently. I'm
> > also interested in the java version because we are trying to embed arrow in
> > spark to implement vectorized processing. Maybe we can work together.
> >
> > Micah Kornfield <emkornfi...@gmail.com> 于 2019年8月5日周一 下午1:50写道：
> >
> > > Hi Anoop,
> > > I think a contribution would be welcome.  There was a recent discussion
> > > thread on what would be expected from new "readers" for Arrow data in
> > Java
> > > [1].  I think its worth reading through but my recollections of the
> > > highlights are:
> > > 1.  A short design sketch in the JIRA that will track the work.
> > > 2.  Off-heap data-structures as much as possible
> > > 3.  An interface that allows predicate push down, column projection and
> > > specifying the batch sizes of reads.  I think there is probably some
> > > interplay here between RowGroup size and size of batches.  It might worth
> > > thinking about this up front and mentioning in the design.
> > > 4.  Performant (since we care going from columnar->columar it should be
> > > faster then Parquet-MR and on-par or better then Spark's implementation
> > > which I believe also goes from columnar to columnar).
> > >
> > > Answers to specific questions below.
> > >
> > > Thanks,
> > > Micah
> > >
> > > To help me get started, are there any pointers on how the C++ or Rust
> > > > implementations currently read Parquet into Arrow?
> > >
> > > I'm not sure about the Rust code, but the C++ code is located at [2], it
> > is
> > > has been going under some recent refactoring (and I think Wes might have
> > 1
> > > or 2 changes till to make).  It doesn't yet support nested data types
> > fully
> > > (e.g. structs).
> > >
> > > Are they reading Parquet row-by-row and building Arrow batches or are
> > there
> > > > better ways of implementing this?
> > >
> > > I believe the implementations should be reading a row-group at a time
> > > column by column.  Spark potentially has an implementation that already
> > > does this.
> > >
> > >
> > > [1]
> > >
> > >
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.apache.org_thread.html_b096528600e66c17af9498c151352f12944ead2fd218a0257fdd4f70-40-253Cdev.arrow.apache.org-253E&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=61WUGQdh6WmssHGmn9FHXw&m=JP24vxBLI1Pwr36FB0vYB769jlgftgc8Bik-P09fJZA&s=fKQYzdomRi1K0lheZFD-gZ59TaHAGaBDzJFApTekkt0&e=
> > > [2]
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_arrow_tree_master_cpp_src_parquet_arrow&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=61WUGQdh6WmssHGmn9FHXw&m=JP24vxBLI1Pwr36FB0vYB769jlgftgc8Bik-P09fJZA&s=7AhyWgqaneMxlWeFA3EKYaygl0RpkC-nVSungqZaVqg&e=
> > >
> > > On Sun, Aug 4, 2019 at 2:52 PM Anoop Johnson <anoop.k.john...@gmail.com>
> > > wrote:
> > >
> > > > Thanks for the response Micah. I could implement this and contribute to
> > > > Arrow Java. To help me get started, are there any pointers on how the
> > C++
> > > > or Rust implementations currently read Parquet into Arrow? Are they
> > > reading
> > > > Parquet row-by-row and building Arrow batches or are there better ways
> > of
> > > > implementing this?
> > > >
> > > > On Tue, Jul 30, 2019 at 1:56 PM Micah Kornfield <emkornfi...@gmail.com
> > >
> > > > wrote:
> > > >
> > > >> Hi Anoop,
> > > >> There isn't currently anything in the Arrow Java library that does
> > this.
> > > >> It is something that I think we want to add at some point.   Dremio
> > [1]
> > > >> has
> > > >> some Parquet related code, but I haven't looked at it to understand
> > how
> > > >> easy it is to use as a standalone library and whether is supports
> > > >> predicate
> > > >> push-down/column selection.
> > > >>
> > > >> Thanks,
> > > >> Micah
> > > >>
> > > >> [1]
> > > >>
> > > >>
> > >
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dremio_dremio-2Doss_tree_master_sabot_kernel_src_main_java_com_dremio_exec_store_parquet&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=61WUGQdh6WmssHGmn9FHXw&m=JP24vxBLI1Pwr36FB0vYB769jlgftgc8Bik-P09fJZA&s=DHkMQ-raZ__SnOH71hieIiLMlPYdEWY_7pqeMowp6wU&e=
> > > >>
> > > >> On Sun, Jul 28, 2019 at 2:08 PM Anoop Johnson <
> > > anoop.k.john...@gmail.com>
> > > >> wrote:
> > > >>
> > > >> > Arrow Newbie here.  What is the recommended way to convert Parquet
> > > data
> > > >> > into Arrow, preferably doing predicate/column pushdown?
> > > >> >
> > > >> > One can implement this as custom code using the Parquet API, and
> > > >> re-encode
> > > >> > it in Arrow using the Arrow APIs, but is this supported by Arrow out
> > > of
> > > >> the
> > > >> > box?
> > > >> >
> > > >> > Thanks,
> > > >> > Anoop
> > > >> >
> > > >>
> > > >
> > >
> >
>

Re: Parquet to Arrow in Java

Reply via email to