+1 Regards, Vaquar khan
On Sep 10, 2017 5:18 AM, "Noman Khan" <nomanbp...@live.com> wrote: > +1 > ------------------------------ > *From:* wangzhenhua (G) <wangzhen...@huawei.com> > *Sent:* Friday, September 8, 2017 2:20:07 AM > *To:* Dongjoon Hyun; 蒋星博 > *Cc:* Michael Armbrust; Reynold Xin; Andrew Ash; Herman van Hövell tot > Westerflier; Ryan Blue; Spark dev list; Suresh Thalamati; Wenchen Fan > *Subject:* 答复: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path > > > +1 (non-binding) Great to see data source API is going to be improved! > > > > best regards, > > -Zhenhua(Xander) > > > > *发件人:* Dongjoon Hyun [mailto:dongjoon.h...@gmail.com] > *发送时间:* 2017年9月8日 4:07 > *收件人:* 蒋星博 > *抄送:* Michael Armbrust; Reynold Xin; Andrew Ash; Herman van Hövell tot > Westerflier; Ryan Blue; Spark dev list; Suresh Thalamati; Wenchen Fan > *主题:* Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path > > > > +1 (non-binding). > > > > On Thu, Sep 7, 2017 at 12:46 PM, 蒋星博 <jiangxb1...@gmail.com> wrote: > > +1 > > > > > > Reynold Xin <r...@databricks.com>于2017年9月7日 周四下午12:04写道: > > +1 as well > > > > On Thu, Sep 7, 2017 at 9:12 PM, Michael Armbrust <mich...@databricks.com> > wrote: > > +1 > > > > On Thu, Sep 7, 2017 at 9:32 AM, Ryan Blue <rb...@netflix.com.invalid> > wrote: > > +1 (non-binding) > > Thanks for making the updates reflected in the current PR. It would be > great to see the doc updated before it is finally published though. > > Right now it feels like this SPIP is focused more on getting the basics > right for what many datasources are already doing in API V1 combined with > other private APIs, vs pushing forward state of the art for performance. > > I think that’s the right approach for this SPIP. We can add the support > you’re talking about later with a more specific plan that doesn’t block > fixing the problems that this addresses. > > > > > > On Thu, Sep 7, 2017 at 2:00 AM, Herman van Hövell tot Westerflier < > hvanhov...@databricks.com> wrote: > > +1 (binding) > > > > I personally believe that there is quite a big difference between having a > generic data source interface with a low surface area and pushing down a > significant part of query processing into a datasource. The later has much > wider wider surface area and will require us to stabilize most of the > internal catalyst API's which will be a significant burden on the community > to maintain and has the potential to slow development velocity > significantly. If you want to write such integrations then you should be > prepared to work with catalyst internals and own up to the fact that things > might change across minor versions (and in some cases even maintenance > releases). If you are willing to go down that road, then your best bet is > to use the already existing spark session extensions which will allow you > to write such integrations and can be used as an `escape hatch`. > > > > > > On Thu, Sep 7, 2017 at 10:23 AM, Andrew Ash <and...@andrewash.com> wrote: > > +0 (non-binding) > > > > I think there are benefits to unifying all the Spark-internal datasources > into a common public API for sure. It will serve as a forcing function to > ensure that those internal datasources aren't advantaged vs datasources > developed externally as plugins to Spark, and that all Spark features are > available to all datasources. > > > > But I also think this read-path proposal avoids the more difficult > questions around how to continue pushing datasource performance forwards. > James Baker (my colleague) had a number of questions about advanced > pushdowns (combined sorting and filtering), and Reynold also noted that > pushdown of aggregates and joins are desirable on longer timeframes as > well. The Spark community saw similar requests, for aggregate pushdown in > SPARK-12686, join pushdown in SPARK-20259, and arbitrary plan pushdown > in SPARK-12449. Clearly a number of people are interested in this kind of > performance work for datasources. > > > > To leave enough space for datasource developers to continue experimenting > with advanced interactions between Spark and their datasources, I'd propose > we leave some sort of escape valve that enables these datasources to keep > pushing the boundaries without forking Spark. Possibly that looks like an > additional unsupported/unstable interface that pushes down an entire > (unstable API) logical plan, which is expected to break API on every > release. (Spark attempts this full-plan pushdown, and if that fails Spark > ignores it and continues on with the rest of the V2 API for > compatibility). Or maybe it looks like something else that we don't know > of yet. Possibly this falls outside of the desired goals for the V2 API > and instead should be a separate SPIP. > > > > If we had a plan for this kind of escape valve for advanced datasource > developers I'd be an unequivocal +1. Right now it feels like this SPIP is > focused more on getting the basics right for what many datasources are > already doing in API V1 combined with other private APIs, vs pushing > forward state of the art for performance. > > > > Andrew > > > > On Wed, Sep 6, 2017 at 10:56 PM, Suresh Thalamati < > suresh.thalam...@gmail.com> wrote: > > +1 (non-binding) > > > > > > On Sep 6, 2017, at 7:29 PM, Wenchen Fan <cloud0...@gmail.com> wrote: > > > > Hi all, > > > > In the previous discussion, we decided to split the read and write path of > data source v2 into 2 SPIPs, and I'm sending this email to call a vote for > Data Source V2 read path only. > > > > The full document of the Data Source API V2 is: > > https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ- > Z8qU5Frf6WMQZ6jJVM/edit > > > > The ready-for-review PR that implements the basic infrastructure for the > read path is: > > https://github.com/apache/spark/pull/19136 > > > > The vote will be up for the next 72 hours. Please reply with your vote: > > > > +1: Yeah, let's go forward and implement the SPIP. > > +0: Don't really care. > > -1: I don't think this is a good idea because of the following technical > reasons. > > > > Thanks! > > > > > > > > > > -- > > Herman van Hövell > > Software Engineer > > Databricks Inc. > > hvanhov...@databricks.com > > +31 6 420 590 27 > > databricks.com > > [image: http://databricks.com] <http://databricks.com/> > > > > [image: Announcing Databricks Serverless. The first serverless data > science and big data platform. Watch the demo from Spark Summit 2017.] > <http://go.databricks.com/announcing-databricks-serverless> > > > > > > -- > > Ryan Blue > > Software Engineer > > Netflix > > > > > > >