Re: Coordinating / scheduling C++ Parquet-Arrow nested data work (ARROW-1644 and others)

Igor Calabria Mon, 03 Feb 2020 16:08:32 -0800

Hi, I would love to help with this issue. I'm aware that this is a huge
task for a first contribution to arrow, but I feel that I could help with
the read path.
Reading parquet seems like a extremely complex task since both hive[0] and
spark[1] tried to implement a "vectorized" version and they all stopped
short of supporting complex types.
I wanted to at least give it a try and find out where the challenge lies.


Since you guys are much more familiar with the current code base, I could
use some starting tips so I don't fall in common pitfalls and whatnot.

[0] https://issues.apache.org/jira/browse/HIVE-18576
[1]
https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java#L45

On 2020/02/03 06:01:25, Micah Kornfield <e...@gmail.com> wrote:
> Just to give an update.  I've been a little bit delayed, but my progress
is>
> as follows:>
> 1.  Had 1 PR merged that will exercise basic end-to-end tests.>
> 2.  Have another PR open that allows a configuration option in C++ to>
> determine which algorithm version to use for reading/writing, the
existing>
> version and the new version supported complex-nested arrays.  I think a>
> large amount of code will be reused/delegated to but I will err on the
side>
> of not touching the existing code/algorithms so that any errors in the>
> implementation  or performance regressions can hopefully be mitigated at>
> runtime.  I expect in later releases (once the code has "baked") will>
> become a no-op.>
> 3.  Started coding the write path.>
>
> Which leaves:>
> 1.  Finishing the write path (I estimate 2-3 weeks) to be code complete>
> 2.  Implementing the read path.>
>
> Again, I'm happy to collaborate if people have bandwidth and want to>
> contribute.>
>
> Thanks,>
> Micah>
>
> On Thu, Jan 9, 2020 at 10:31 PM Micah Kornfield <em...@gmail.com>>
> wrote:>
>
> > Hi Wes,>
> > I'm still interested in doing the work.  But don't to hold anybody up
if>
> > they have bandwidth.>
> >>
> > In order to actually make progress on this, my plan will be to:>
> > 1.  Help with the current Java review backlog through early next week
or>
> > so (this has been taking the majority of my time allocated for Arrow>
> > contributions for the last 6 months or so).>
> > 2.  Shift all my attention to trying to get this done (this means no>
> > reviews other then closing out existing ones that I've started until it
is>
> > done).  Hopefully, other Java committers can help shrink the backlog>
> > further (Jacques thanks for you recent efforts here).>
> >>
> > Thanks,>
> > Micah>
> >>
> > On Thu, Jan 9, 2020 at 8:16 AM Wes McKinney <we...@gmail.com> wrote:>
> >>
> >> hi folks,>
> >>>
> >> I think we have reached a point where the incomplete C++ Parquet>
> >> nested data assembly/disassembly is harming the value of several>
> >> others parts of the project, for example the Datasets API. As another>
> >> example, it's possible to ingest nested data from JSON but not write>
> >> it to Parquet in general.>
> >>>
> >> Implementing the nested data read and write path completely is a>
> >> difficult project requiring at least several weeks of dedicated work,>
> >> so it's not so surprising that it hasn't been accomplished yet. I
know>
> >> that several people have expressed interest in working on it, but I>
> >> would like to see if anyone would be able to volunteer a commitment
of>
> >> time and guess on a rough timeline when this work could be done. It>
> >> seems to me if this slips beyond 2020 it will significant diminish
the>
> >> value being created by other parts of the project.>
> >>>
> >> Since I'm pretty familiar with all the Parquet code I'm one candidate>
> >> person to take on this project (and I can dedicate the time, but it>
> >> would come at the expense of other projects where I can also be>
> >> useful). But Micah and others expressed interest in working on it, so>
> >> I wanted to have a discussion about it to see what others think.>
> >>>
> >> Thanks>
> >> Wes>
> >>>
> >>
>

Re: Coordinating / scheduling C++ Parquet-Arrow nested data work (ARROW-1644 and others)

Reply via email to