hi Micah, I'm glad that we have the write side of nested completed for 0.17.0.
As far as completing the read side and then implementing sufficient testing to exercise corner cases in end-to-end reads/writes, do you anticipate being able to work on this in the next 4-6 weeks (obviously the state of the world has affected everyone's availability / bandwidth)? I ask because someone from my team (or me also) may be able to get involved and help this move along. It'd be great to have this 100% completed and checked off our list for the next release (i.e. 0.18.0 or 1.0.0 depending on whether the Java/C++ integration tests get completed also) thanks Wes On Wed, Feb 5, 2020 at 12:12 AM Micah Kornfield <[email protected]> wrote: >> >> Glad to hear about the progress. As I mentioned on #2, what do you >> think about setting up a feature branch for you to merge PRs into? >> Then the branch can be iterated on and we can merge it back when it's >> feature complete and does not have perf regressions for the flat >> read/write path. >> > I'd like to avoid a separate branch if possible. I'm willing to close the > open PR till I'm sure it is needed but I'm hoping keeping PRs as small > focused as possible with performance testing a long the way will be a better > reviewer and developer experience here. > >> The earliest I'd have time to work on this myself would likely be >> sometime in March. Others are welcome to jump in as well (and it'd be >> great to increase the overall level of knowledge of the Parquet >> codebase) > > Hopefully, Igor can help out otherwise I'll take up the read path after I > finish the write path. > > -Micah > > On Tue, Feb 4, 2020 at 3:31 PM Wes McKinney <[email protected]> wrote: >> >> hi Micah >> >> On Mon, Feb 3, 2020 at 12:01 AM Micah Kornfield <[email protected]> >> wrote: >> > >> > Just to give an update. I've been a little bit delayed, but my progress is >> > as follows: >> > 1. Had 1 PR merged that will exercise basic end-to-end tests. >> > 2. Have another PR open that allows a configuration option in C++ to >> > determine which algorithm version to use for reading/writing, the existing >> > version and the new version supported complex-nested arrays. I think a >> > large amount of code will be reused/delegated to but I will err on the side >> > of not touching the existing code/algorithms so that any errors in the >> > implementation or performance regressions can hopefully be mitigated at >> > runtime. I expect in later releases (once the code has "baked") will >> > become a no-op. >> >> Glad to hear about the progress. As I mentioned on #2, what do you >> think about setting up a feature branch for you to merge PRs into? >> Then the branch can be iterated on and we can merge it back when it's >> feature complete and does not have perf regressions for the flat >> read/write path. >> >> > 3. Started coding the write path. >> > >> > Which leaves: >> > 1. Finishing the write path (I estimate 2-3 weeks) to be code complete >> > 2. Implementing the read path. >> >> The earliest I'd have time to work on this myself would likely be >> sometime in March. Others are welcome to jump in as well (and it'd be >> great to increase the overall level of knowledge of the Parquet >> codebase) >> >> > Again, I'm happy to collaborate if people have bandwidth and want to >> > contribute. >> > >> > Thanks, >> > Micah >> > >> > On Thu, Jan 9, 2020 at 10:31 PM Micah Kornfield <[email protected]> >> > wrote: >> > >> > > Hi Wes, >> > > I'm still interested in doing the work. But don't to hold anybody up if >> > > they have bandwidth. >> > > >> > > In order to actually make progress on this, my plan will be to: >> > > 1. Help with the current Java review backlog through early next week or >> > > so (this has been taking the majority of my time allocated for Arrow >> > > contributions for the last 6 months or so). >> > > 2. Shift all my attention to trying to get this done (this means no >> > > reviews other then closing out existing ones that I've started until it >> > > is >> > > done). Hopefully, other Java committers can help shrink the backlog >> > > further (Jacques thanks for you recent efforts here). >> > > >> > > Thanks, >> > > Micah >> > > >> > > On Thu, Jan 9, 2020 at 8:16 AM Wes McKinney <[email protected]> wrote: >> > > >> > >> hi folks, >> > >> >> > >> I think we have reached a point where the incomplete C++ Parquet >> > >> nested data assembly/disassembly is harming the value of several >> > >> others parts of the project, for example the Datasets API. As another >> > >> example, it's possible to ingest nested data from JSON but not write >> > >> it to Parquet in general. >> > >> >> > >> Implementing the nested data read and write path completely is a >> > >> difficult project requiring at least several weeks of dedicated work, >> > >> so it's not so surprising that it hasn't been accomplished yet. I know >> > >> that several people have expressed interest in working on it, but I >> > >> would like to see if anyone would be able to volunteer a commitment of >> > >> time and guess on a rough timeline when this work could be done. It >> > >> seems to me if this slips beyond 2020 it will significant diminish the >> > >> value being created by other parts of the project. >> > >> >> > >> Since I'm pretty familiar with all the Parquet code I'm one candidate >> > >> person to take on this project (and I can dedicate the time, but it >> > >> would come at the expense of other projects where I can also be >> > >> useful). But Micah and others expressed interest in working on it, so >> > >> I wanted to have a discussion about it to see what others think. >> > >> >> > >> Thanks >> > >> Wes >> > >> >> > >
