Re: Coordinating / scheduling C++ Parquet-Arrow nested data work (ARROW-1644 and others)

Wes McKinney Tue, 14 Apr 2020 07:49:05 -0700

hi Micah,

I'm glad that we have the write side of nested completed for 0.17.0.


As far as completing the read side and then implementing sufficient
testing to exercise corner cases in end-to-end reads/writes, do you
anticipate being able to work on this in the next 4-6 weeks (obviously
the state of the world has affected everyone's availability /
bandwidth)? I ask because someone from my team (or me also) may be
able to get involved and help this move along. It'd be great to have
this 100% completed and checked off our list for the next release
(i.e. 0.18.0 or 1.0.0 depending on whether the Java/C++ integration
tests get completed also)

thanks
Wes

On Wed, Feb 5, 2020 at 12:12 AM Micah Kornfield <[email protected]> wrote:
>>
>> Glad to hear about the progress. As I mentioned on #2, what do you
>> think about setting up a feature branch for you to merge PRs into?
>> Then the branch can be iterated on and we can merge it back when it's
>> feature complete and does not have perf regressions for the flat
>> read/write path.
>>
> I'd like to avoid a separate branch if possible.  I'm willing to close the 
> open PR till I'm sure it is needed but I'm hoping keeping PRs as small 
> focused as possible with performance testing a long the way will be a better 
> reviewer and developer experience here.
>
>> The earliest I'd have time to work on this myself would likely be
>> sometime in March. Others are welcome to jump in as well (and it'd be
>> great to increase the overall level of knowledge of the Parquet
>> codebase)
>
> Hopefully, Igor can help out otherwise I'll take up the read path after I 
> finish the write path.
>
> -Micah
>
> On Tue, Feb 4, 2020 at 3:31 PM Wes McKinney <[email protected]> wrote:
>>
>> hi Micah
>>
>> On Mon, Feb 3, 2020 at 12:01 AM Micah Kornfield <[email protected]> 
>> wrote:
>> >
>> > Just to give an update.  I've been a little bit delayed, but my progress is
>> > as follows:
>> > 1.  Had 1 PR merged that will exercise basic end-to-end tests.
>> > 2.  Have another PR open that allows a configuration option in C++ to
>> > determine which algorithm version to use for reading/writing, the existing
>> > version and the new version supported complex-nested arrays.  I think a
>> > large amount of code will be reused/delegated to but I will err on the side
>> > of not touching the existing code/algorithms so that any errors in the
>> > implementation  or performance regressions can hopefully be mitigated at
>> > runtime.  I expect in later releases (once the code has "baked") will
>> > become a no-op.
>>
>> Glad to hear about the progress. As I mentioned on #2, what do you
>> think about setting up a feature branch for you to merge PRs into?
>> Then the branch can be iterated on and we can merge it back when it's
>> feature complete and does not have perf regressions for the flat
>> read/write path.
>>
>> > 3.  Started coding the write path.
>> >
>> > Which leaves:
>> > 1.  Finishing the write path (I estimate 2-3 weeks) to be code complete
>> > 2.  Implementing the read path.
>>
>> The earliest I'd have time to work on this myself would likely be
>> sometime in March. Others are welcome to jump in as well (and it'd be
>> great to increase the overall level of knowledge of the Parquet
>> codebase)
>>
>> > Again, I'm happy to collaborate if people have bandwidth and want to
>> > contribute.
>> >
>> > Thanks,
>> > Micah
>> >
>> > On Thu, Jan 9, 2020 at 10:31 PM Micah Kornfield <[email protected]>
>> > wrote:
>> >
>> > > Hi Wes,
>> > > I'm still interested in doing the work.  But don't to hold anybody up if
>> > > they have bandwidth.
>> > >
>> > > In order to actually make progress on this, my plan will be to:
>> > > 1.  Help with the current Java review backlog through early next week or
>> > > so (this has been taking the majority of my time allocated for Arrow
>> > > contributions for the last 6 months or so).
>> > > 2.  Shift all my attention to trying to get this done (this means no
>> > > reviews other then closing out existing ones that I've started until it 
>> > > is
>> > > done).  Hopefully, other Java committers can help shrink the backlog
>> > > further (Jacques thanks for you recent efforts here).
>> > >
>> > > Thanks,
>> > > Micah
>> > >
>> > > On Thu, Jan 9, 2020 at 8:16 AM Wes McKinney <[email protected]> wrote:
>> > >
>> > >> hi folks,
>> > >>
>> > >> I think we have reached a point where the incomplete C++ Parquet
>> > >> nested data assembly/disassembly is harming the value of several
>> > >> others parts of the project, for example the Datasets API. As another
>> > >> example, it's possible to ingest nested data from JSON but not write
>> > >> it to Parquet in general.
>> > >>
>> > >> Implementing the nested data read and write path completely is a
>> > >> difficult project requiring at least several weeks of dedicated work,
>> > >> so it's not so surprising that it hasn't been accomplished yet. I know
>> > >> that several people have expressed interest in working on it, but I
>> > >> would like to see if anyone would be able to volunteer a commitment of
>> > >> time and guess on a rough timeline when this work could be done. It
>> > >> seems to me if this slips beyond 2020 it will significant diminish the
>> > >> value being created by other parts of the project.
>> > >>
>> > >> Since I'm pretty familiar with all the Parquet code I'm one candidate
>> > >> person to take on this project (and I can dedicate the time, but it
>> > >> would come at the expense of other projects where I can also be
>> > >> useful). But Micah and others expressed interest in working on it, so
>> > >> I wanted to have a discussion about it to see what others think.
>> > >>
>> > >> Thanks
>> > >> Wes
>> > >>
>> > >

Re: Coordinating / scheduling C++ Parquet-Arrow nested data work (ARROW-1644 and others)

Reply via email to