This vote passes with 4 binding +1 votes, 10 non-binding votes, one +0 vote, and no -1 votes.
Thanks all! +1 votes (binding): Wenchen Fan Herman van Hövell tot Westerflier Michael Armbrust Reynold Xin +1 votes (non-binding): Xiao Li Sameer Agarwal Suresh Thalamati Ryan Blue Xingbo Jiang Dongjoon Hyun Zhenhua Wang Noman Khan vaquar khan Hemant Bhanawat +0 votes: Andrew Ash On Mon, Sep 11, 2017 at 4:03 PM, Wenchen Fan <cloud0...@gmail.com> wrote: > yea, join push down (providing the other reader and join conditions) and > aggregate push down (providing grouping keys and aggregate functions) can > be added via the current framework in the future. > > On Mon, Sep 11, 2017 at 1:54 PM, Hemant Bhanawat <hemant9...@gmail.com> > wrote: > >> +1 (non-binding) >> >> I have found the suggestion from Andrew Ash and James about plan push >> down quite interesting. However, I am not clear about the join push-down >> support at the data source level. Shouldn't it be the responsibility of the >> join node to carry out a data source specific join? I mean join node and >> the data source scan of the two sides can be coalesced into a single node >> (theoretically). This can be done by providing a Strategy that replaces the >> join node with a data source specific join node. We are doing it that way >> for our data sources. I find this more intuitive. >> >> BTW, aggregate push-down support is desirable and should be considered as >> an enhancement going forward. >> >> Hemant Bhanawat <https://www.linkedin.com/in/hemant-bhanawat-92a3811> >> www.snappydata.io >> >> On Sun, Sep 10, 2017 at 8:45 PM, vaquar khan <vaquar.k...@gmail.com> >> wrote: >> >>> +1 >>> >>> Regards, >>> Vaquar khan >>> >>> On Sep 10, 2017 5:18 AM, "Noman Khan" <nomanbp...@live.com> wrote: >>> >>>> +1 >>>> ------------------------------ >>>> *From:* wangzhenhua (G) <wangzhen...@huawei.com> >>>> *Sent:* Friday, September 8, 2017 2:20:07 AM >>>> *To:* Dongjoon Hyun; 蒋星博 >>>> *Cc:* Michael Armbrust; Reynold Xin; Andrew Ash; Herman van Hövell tot >>>> Westerflier; Ryan Blue; Spark dev list; Suresh Thalamati; Wenchen Fan >>>> *Subject:* 答复: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path >>>> >>>> >>>> +1 (non-binding) Great to see data source API is going to be improved! >>>> >>>> >>>> >>>> best regards, >>>> >>>> -Zhenhua(Xander) >>>> >>>> >>>> >>>> *发件人:* Dongjoon Hyun [mailto:dongjoon.h...@gmail.com] >>>> *发送时间:* 2017年9月8日 4:07 >>>> *收件人:* 蒋星博 >>>> *抄送:* Michael Armbrust; Reynold Xin; Andrew Ash; Herman van Hövell tot >>>> Westerflier; Ryan Blue; Spark dev list; Suresh Thalamati; Wenchen Fan >>>> *主题:* Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path >>>> >>>> >>>> >>>> +1 (non-binding). >>>> >>>> >>>> >>>> On Thu, Sep 7, 2017 at 12:46 PM, 蒋星博 <jiangxb1...@gmail.com> wrote: >>>> >>>> +1 >>>> >>>> >>>> >>>> >>>> >>>> Reynold Xin <r...@databricks.com>于2017年9月7日 周四下午12:04写道: >>>> >>>> +1 as well >>>> >>>> >>>> >>>> On Thu, Sep 7, 2017 at 9:12 PM, Michael Armbrust < >>>> mich...@databricks.com> wrote: >>>> >>>> +1 >>>> >>>> >>>> >>>> On Thu, Sep 7, 2017 at 9:32 AM, Ryan Blue <rb...@netflix.com.invalid> >>>> wrote: >>>> >>>> +1 (non-binding) >>>> >>>> Thanks for making the updates reflected in the current PR. It would be >>>> great to see the doc updated before it is finally published though. >>>> >>>> Right now it feels like this SPIP is focused more on getting the basics >>>> right for what many datasources are already doing in API V1 combined with >>>> other private APIs, vs pushing forward state of the art for performance. >>>> >>>> I think that’s the right approach for this SPIP. We can add the support >>>> you’re talking about later with a more specific plan that doesn’t block >>>> fixing the problems that this addresses. >>>> >>>> >>>> >>>> >>>> >>>> On Thu, Sep 7, 2017 at 2:00 AM, Herman van Hövell tot Westerflier < >>>> hvanhov...@databricks.com> wrote: >>>> >>>> +1 (binding) >>>> >>>> >>>> >>>> I personally believe that there is quite a big difference between >>>> having a generic data source interface with a low surface area and pushing >>>> down a significant part of query processing into a datasource. The later >>>> has much wider wider surface area and will require us to stabilize most of >>>> the internal catalyst API's which will be a significant burden on the >>>> community to maintain and has the potential to slow development velocity >>>> significantly. If you want to write such integrations then you should be >>>> prepared to work with catalyst internals and own up to the fact that things >>>> might change across minor versions (and in some cases even maintenance >>>> releases). If you are willing to go down that road, then your best bet is >>>> to use the already existing spark session extensions which will allow you >>>> to write such integrations and can be used as an `escape hatch`. >>>> >>>> >>>> >>>> >>>> >>>> On Thu, Sep 7, 2017 at 10:23 AM, Andrew Ash <and...@andrewash.com> >>>> wrote: >>>> >>>> +0 (non-binding) >>>> >>>> >>>> >>>> I think there are benefits to unifying all the Spark-internal >>>> datasources into a common public API for sure. It will serve as a forcing >>>> function to ensure that those internal datasources aren't advantaged vs >>>> datasources developed externally as plugins to Spark, and that all Spark >>>> features are available to all datasources. >>>> >>>> >>>> >>>> But I also think this read-path proposal avoids the more difficult >>>> questions around how to continue pushing datasource performance forwards. >>>> James Baker (my colleague) had a number of questions about advanced >>>> pushdowns (combined sorting and filtering), and Reynold also noted that >>>> pushdown of aggregates and joins are desirable on longer timeframes as >>>> well. The Spark community saw similar requests, for aggregate pushdown in >>>> SPARK-12686, join pushdown in SPARK-20259, and arbitrary plan pushdown >>>> in SPARK-12449. Clearly a number of people are interested in this kind of >>>> performance work for datasources. >>>> >>>> >>>> >>>> To leave enough space for datasource developers to continue >>>> experimenting with advanced interactions between Spark and their >>>> datasources, I'd propose we leave some sort of escape valve that enables >>>> these datasources to keep pushing the boundaries without forking Spark. >>>> Possibly that looks like an additional unsupported/unstable interface that >>>> pushes down an entire (unstable API) logical plan, which is expected to >>>> break API on every release. (Spark attempts this full-plan pushdown, and >>>> if that fails Spark ignores it and continues on with the rest of the V2 API >>>> for compatibility). Or maybe it looks like something else that we don't >>>> know of yet. Possibly this falls outside of the desired goals for the V2 >>>> API and instead should be a separate SPIP. >>>> >>>> >>>> >>>> If we had a plan for this kind of escape valve for advanced datasource >>>> developers I'd be an unequivocal +1. Right now it feels like this SPIP is >>>> focused more on getting the basics right for what many datasources are >>>> already doing in API V1 combined with other private APIs, vs pushing >>>> forward state of the art for performance. >>>> >>>> >>>> >>>> Andrew >>>> >>>> >>>> >>>> On Wed, Sep 6, 2017 at 10:56 PM, Suresh Thalamati < >>>> suresh.thalam...@gmail.com> wrote: >>>> >>>> +1 (non-binding) >>>> >>>> >>>> >>>> >>>> >>>> On Sep 6, 2017, at 7:29 PM, Wenchen Fan <cloud0...@gmail.com> wrote: >>>> >>>> >>>> >>>> Hi all, >>>> >>>> >>>> >>>> In the previous discussion, we decided to split the read and write path >>>> of data source v2 into 2 SPIPs, and I'm sending this email to call a vote >>>> for Data Source V2 read path only. >>>> >>>> >>>> >>>> The full document of the Data Source API V2 is: >>>> >>>> https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ >>>> -Z8qU5Frf6WMQZ6jJVM/edit >>>> >>>> >>>> >>>> The ready-for-review PR that implements the basic infrastructure for >>>> the read path is: >>>> >>>> https://github.com/apache/spark/pull/19136 >>>> >>>> >>>> >>>> The vote will be up for the next 72 hours. Please reply with your vote: >>>> >>>> >>>> >>>> +1: Yeah, let's go forward and implement the SPIP. >>>> >>>> +0: Don't really care. >>>> >>>> -1: I don't think this is a good idea because of the following >>>> technical reasons. >>>> >>>> >>>> >>>> Thanks! >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> -- >>>> >>>> Herman van Hövell >>>> >>>> Software Engineer >>>> >>>> Databricks Inc. >>>> >>>> hvanhov...@databricks.com >>>> >>>> +31 6 420 590 27 >>>> >>>> databricks.com >>>> >>>> [image: http://databricks.com] <http://databricks.com/> >>>> >>>> >>>> >>>> [image: Announcing Databricks Serverless. The first serverless data >>>> science and big data platform. Watch the demo from Spark Summit 2017.] >>>> <http://go.databricks.com/announcing-databricks-serverless> >>>> >>>> >>>> >>>> >>>> >>>> -- >>>> >>>> Ryan Blue >>>> >>>> Software Engineer >>>> >>>> Netflix >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>> >> >