We don't mind updating Iceberg when the API improves. We are fully aware that this is a very early implementation and will change. My hope is that the community is receptive to our suggestions.
A good example of an area with friction is filter and projection push-down. The implementation for DSv2 isn't based on what the other read paths do, it is a brand new and mostly untested. I don't really understand why DSv2 introduced a new code path, when reusing existing code for this ended up being smaller and works for more cases (see my comments on #20476 <https://github.com/apache/spark/pull/20476>). I understand wanting to fix parts of push-down, just not why it is a good idea to mix that substantial change into an unrelated API update. This is one area where, I hope, our suggestion to get DSv2 working well and redesign push-down as a parallel effort is heard. I also see a few areas where the integration of DSv2 conflicts with what I understand to be design principles of the catalyst optimizer. The fact that it should use immutable nodes in plans is mostly settled, but there are other examples. The approach of the new push-down implementation fights against the principle of small rules that don't need to process the entire plan tree. I think this makes the component brittle, and I'd like to understand the rationale for going with this design. I'd love to see a design document that covers why this is a necessary choice (but again, separately). rb On Thu, Feb 1, 2018 at 9:10 AM, Felix Cheung <felixcheun...@hotmail.com> wrote: > +1 hangout > > ------------------------------ > *From:* Xiao Li <gatorsm...@gmail.com> > *Sent:* Wednesday, January 31, 2018 10:46:26 PM > *To:* Ryan Blue > *Cc:* Reynold Xin; dev; Wenchen Fen; Russell Spitzer > *Subject:* Re: data source v2 online meetup > > Hi, Ryan, > > wow, your Iceberg already used data source V2 API! That is pretty cool! I > am just afraid these new APIs are not stable. We might deprecate or change > some data source v2 APIs in the next version (2.4). Sorry for the > inconvenience it might introduce. > > Thanks for your feedback always, > > Xiao > > > 2018-01-31 15:54 GMT-08:00 Ryan Blue <rb...@netflix.com.invalid>: > >> Thanks for suggesting this, I think it's a great idea. I'll definitely >> attend and can talk about the changes that we've made DataSourceV2 to >> enable our new table format, Iceberg >> <https://github.com/Netflix/iceberg#about-iceberg>. >> >> On Wed, Jan 31, 2018 at 2:35 PM, Reynold Xin <r...@databricks.com> wrote: >> >>> Data source v2 API is one of the larger main changes in Spark 2.3, and >>> whatever that has already been committed is only the first version and we'd >>> need more work post-2.3 to improve and stablize it. >>> >>> I think at this point we should stop making changes to it in branch-2.3, >>> and instead focus on using the existing API and getting feedback for 2.4. >>> Would people be interested in doing an online hangout to discuss this, >>> perhaps in the month of Feb? >>> >>> It'd be more productive if people attending the hangout have tried the >>> API by implementing some new sources or porting an existing source over. >>> >>> >>> >> >> >> -- >> Ryan Blue >> Software Engineer >> Netflix >> > > -- Ryan Blue Software Engineer Netflix