Hi, I would love to help with this issue. I'm aware that this is a huge task for a first contribution to arrow, but I feel that I could help with the read path. Reading parquet seems like a extremely complex task since both hive[0] and spark[1] tried to implement a "vectorized" version and they all stopped short of supporting complex types. I wanted to at least give it a try and find out where the challenge lies.
Since you guys are much more familiar with the current code base, I could use some starting tips so I don't fall in common pitfalls and whatnot. [0] https://issues.apache.org/jira/browse/HIVE-18576 [1] https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java#L45 On 2020/02/03 06:01:25, Micah Kornfield <e...@gmail.com> wrote: > Just to give an update. I've been a little bit delayed, but my progress is> > as follows:> > 1. Had 1 PR merged that will exercise basic end-to-end tests.> > 2. Have another PR open that allows a configuration option in C++ to> > determine which algorithm version to use for reading/writing, the existing> > version and the new version supported complex-nested arrays. I think a> > large amount of code will be reused/delegated to but I will err on the side> > of not touching the existing code/algorithms so that any errors in the> > implementation or performance regressions can hopefully be mitigated at> > runtime. I expect in later releases (once the code has "baked") will> > become a no-op.> > 3. Started coding the write path.> > > Which leaves:> > 1. Finishing the write path (I estimate 2-3 weeks) to be code complete> > 2. Implementing the read path.> > > Again, I'm happy to collaborate if people have bandwidth and want to> > contribute.> > > Thanks,> > Micah> > > On Thu, Jan 9, 2020 at 10:31 PM Micah Kornfield <em...@gmail.com>> > wrote:> > > > Hi Wes,> > > I'm still interested in doing the work. But don't to hold anybody up if> > > they have bandwidth.> > >> > > In order to actually make progress on this, my plan will be to:> > > 1. Help with the current Java review backlog through early next week or> > > so (this has been taking the majority of my time allocated for Arrow> > > contributions for the last 6 months or so).> > > 2. Shift all my attention to trying to get this done (this means no> > > reviews other then closing out existing ones that I've started until it is> > > done). Hopefully, other Java committers can help shrink the backlog> > > further (Jacques thanks for you recent efforts here).> > >> > > Thanks,> > > Micah> > >> > > On Thu, Jan 9, 2020 at 8:16 AM Wes McKinney <we...@gmail.com> wrote:> > >> > >> hi folks,> > >>> > >> I think we have reached a point where the incomplete C++ Parquet> > >> nested data assembly/disassembly is harming the value of several> > >> others parts of the project, for example the Datasets API. As another> > >> example, it's possible to ingest nested data from JSON but not write> > >> it to Parquet in general.> > >>> > >> Implementing the nested data read and write path completely is a> > >> difficult project requiring at least several weeks of dedicated work,> > >> so it's not so surprising that it hasn't been accomplished yet. I know> > >> that several people have expressed interest in working on it, but I> > >> would like to see if anyone would be able to volunteer a commitment of> > >> time and guess on a rough timeline when this work could be done. It> > >> seems to me if this slips beyond 2020 it will significant diminish the> > >> value being created by other parts of the project.> > >>> > >> Since I'm pretty familiar with all the Parquet code I'm one candidate> > >> person to take on this project (and I can dedicate the time, but it> > >> would come at the expense of other projects where I can also be> > >> useful). But Micah and others expressed interest in working on it, so> > >> I wanted to have a discussion about it to see what others think.> > >>> > >> Thanks> > >> Wes> > >>> > >> >