Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

Wenchen Fan Mon, 11 Sep 2017 01:13:31 -0700

This vote passes with 4 binding +1 votes, 10 non-binding votes, one +0
vote, and no -1 votes.


Thanks all!

+1 votes (binding):
Wenchen Fan
Herman van Hövell tot Westerflier
Michael Armbrust
Reynold Xin


+1 votes (non-binding):
Xiao Li
Sameer Agarwal
Suresh Thalamati
Ryan Blue
Xingbo Jiang
Dongjoon Hyun
Zhenhua Wang
Noman Khan
vaquar khan
Hemant Bhanawat

+0 votes:
Andrew Ash

On Mon, Sep 11, 2017 at 4:03 PM, Wenchen Fan <cloud0...@gmail.com> wrote:

> yea, join push down (providing the other reader and join conditions) and
> aggregate push down (providing grouping keys and aggregate functions) can
> be added via the current framework in the future.
>
> On Mon, Sep 11, 2017 at 1:54 PM, Hemant Bhanawat <hemant9...@gmail.com>
> wrote:
>
>> +1 (non-binding)
>>
>> I have found the suggestion from Andrew Ash and James about plan push
>> down quite interesting. However, I am not clear about the join push-down
>> support at the data source level. Shouldn't it be the responsibility of the
>> join node to carry out a data source specific join? I mean join node and
>> the data source scan of the two sides can be coalesced into a single node
>> (theoretically). This can be done by providing a Strategy that replaces the
>> join node with a data source specific join node. We are doing it that way
>> for our data sources. I find this more intuitive.
>>
>> BTW, aggregate push-down support is desirable and should be considered as
>> an enhancement going forward.
>>
>> Hemant Bhanawat <https://www.linkedin.com/in/hemant-bhanawat-92a3811>
>> www.snappydata.io
>>
>> On Sun, Sep 10, 2017 at 8:45 PM, vaquar khan <vaquar.k...@gmail.com>
>> wrote:
>>
>>> +1
>>>
>>> Regards,
>>> Vaquar khan
>>>
>>> On Sep 10, 2017 5:18 AM, "Noman Khan" <nomanbp...@live.com> wrote:
>>>
>>>> +1
>>>> ------------------------------
>>>> *From:* wangzhenhua (G) <wangzhen...@huawei.com>
>>>> *Sent:* Friday, September 8, 2017 2:20:07 AM
>>>> *To:* Dongjoon Hyun; 蒋星博
>>>> *Cc:* Michael Armbrust; Reynold Xin; Andrew Ash; Herman van Hövell tot
>>>> Westerflier; Ryan Blue; Spark dev list; Suresh Thalamati; Wenchen Fan
>>>> *Subject:* 答复: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path
>>>>
>>>>
>>>> +1 (non-binding)  Great to see data source API is going to be improved!
>>>>
>>>>
>>>>
>>>> best regards,
>>>>
>>>> -Zhenhua(Xander)
>>>>
>>>>
>>>>
>>>> *发件人:* Dongjoon Hyun [mailto:dongjoon.h...@gmail.com]
>>>> *发送时间:* 2017年9月8日 4:07
>>>> *收件人:* 蒋星博
>>>> *抄送:* Michael Armbrust; Reynold Xin; Andrew Ash; Herman van Hövell tot
>>>> Westerflier; Ryan Blue; Spark dev list; Suresh Thalamati; Wenchen Fan
>>>> *主题:* Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path
>>>>
>>>>
>>>>
>>>> +1 (non-binding).
>>>>
>>>>
>>>>
>>>> On Thu, Sep 7, 2017 at 12:46 PM, 蒋星博 <jiangxb1...@gmail.com> wrote:
>>>>
>>>> +1
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Reynold Xin <r...@databricks.com>于2017年9月7日 周四下午12:04写道：
>>>>
>>>> +1 as well
>>>>
>>>>
>>>>
>>>> On Thu, Sep 7, 2017 at 9:12 PM, Michael Armbrust <
>>>> mich...@databricks.com> wrote:
>>>>
>>>> +1
>>>>
>>>>
>>>>
>>>> On Thu, Sep 7, 2017 at 9:32 AM, Ryan Blue <rb...@netflix.com.invalid>
>>>> wrote:
>>>>
>>>> +1 (non-binding)
>>>>
>>>> Thanks for making the updates reflected in the current PR. It would be
>>>> great to see the doc updated before it is finally published though.
>>>>
>>>> Right now it feels like this SPIP is focused more on getting the basics
>>>> right for what many datasources are already doing in API V1 combined with
>>>> other private APIs, vs pushing forward state of the art for performance.
>>>>
>>>> I think that’s the right approach for this SPIP. We can add the support
>>>> you’re talking about later with a more specific plan that doesn’t block
>>>> fixing the problems that this addresses.
>>>>
>>>> 
>>>>
>>>>
>>>>
>>>> On Thu, Sep 7, 2017 at 2:00 AM, Herman van Hövell tot Westerflier <
>>>> hvanhov...@databricks.com> wrote:
>>>>
>>>> +1 (binding)
>>>>
>>>>
>>>>
>>>> I personally believe that there is quite a big difference between
>>>> having a generic data source interface with a low surface area and pushing
>>>> down a significant part of query processing into a datasource. The later
>>>> has much wider wider surface area and will require us to stabilize most of
>>>> the internal catalyst API's which will be a significant burden on the
>>>> community to maintain and has the potential to slow development velocity
>>>> significantly. If you want to write such integrations then you should be
>>>> prepared to work with catalyst internals and own up to the fact that things
>>>> might change across minor versions (and in some cases even maintenance
>>>> releases). If you are willing to go down that road, then your best bet is
>>>> to use the already existing spark session extensions which will allow you
>>>> to write such integrations and can be used as an `escape hatch`.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Sep 7, 2017 at 10:23 AM, Andrew Ash <and...@andrewash.com>
>>>> wrote:
>>>>
>>>> +0 (non-binding)
>>>>
>>>>
>>>>
>>>> I think there are benefits to unifying all the Spark-internal
>>>> datasources into a common public API for sure.  It will serve as a forcing
>>>> function to ensure that those internal datasources aren't advantaged vs
>>>> datasources developed externally as plugins to Spark, and that all Spark
>>>> features are available to all datasources.
>>>>
>>>>
>>>>
>>>> But I also think this read-path proposal avoids the more difficult
>>>> questions around how to continue pushing datasource performance forwards.
>>>> James Baker (my colleague) had a number of questions about advanced
>>>> pushdowns (combined sorting and filtering), and Reynold also noted that
>>>> pushdown of aggregates and joins are desirable on longer timeframes as
>>>> well.  The Spark community saw similar requests, for aggregate pushdown in
>>>> SPARK-12686, join pushdown in SPARK-20259, and arbitrary plan pushdown
>>>> in SPARK-12449.  Clearly a number of people are interested in this kind of
>>>> performance work for datasources.
>>>>
>>>>
>>>>
>>>> To leave enough space for datasource developers to continue
>>>> experimenting with advanced interactions between Spark and their
>>>> datasources, I'd propose we leave some sort of escape valve that enables
>>>> these datasources to keep pushing the boundaries without forking Spark.
>>>> Possibly that looks like an additional unsupported/unstable interface that
>>>> pushes down an entire (unstable API) logical plan, which is expected to
>>>> break API on every release.   (Spark attempts this full-plan pushdown, and
>>>> if that fails Spark ignores it and continues on with the rest of the V2 API
>>>> for compatibility).  Or maybe it looks like something else that we don't
>>>> know of yet.  Possibly this falls outside of the desired goals for the V2
>>>> API and instead should be a separate SPIP.
>>>>
>>>>
>>>>
>>>> If we had a plan for this kind of escape valve for advanced datasource
>>>> developers I'd be an unequivocal +1.  Right now it feels like this SPIP is
>>>> focused more on getting the basics right for what many datasources are
>>>> already doing in API V1 combined with other private APIs, vs pushing
>>>> forward state of the art for performance.
>>>>
>>>>
>>>>
>>>> Andrew
>>>>
>>>>
>>>>
>>>> On Wed, Sep 6, 2017 at 10:56 PM, Suresh Thalamati <
>>>> suresh.thalam...@gmail.com> wrote:
>>>>
>>>> +1 (non-binding)
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Sep 6, 2017, at 7:29 PM, Wenchen Fan <cloud0...@gmail.com> wrote:
>>>>
>>>>
>>>>
>>>> Hi all,
>>>>
>>>>
>>>>
>>>> In the previous discussion, we decided to split the read and write path
>>>> of data source v2 into 2 SPIPs, and I'm sending this email to call a vote
>>>> for Data Source V2 read path only.
>>>>
>>>>
>>>>
>>>> The full document of the Data Source API V2 is:
>>>>
>>>> https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ
>>>> -Z8qU5Frf6WMQZ6jJVM/edit
>>>>
>>>>
>>>>
>>>> The ready-for-review PR that implements the basic infrastructure for
>>>> the read path is:
>>>>
>>>> https://github.com/apache/spark/pull/19136
>>>>
>>>>
>>>>
>>>> The vote will be up for the next 72 hours. Please reply with your vote:
>>>>
>>>>
>>>>
>>>> +1: Yeah, let's go forward and implement the SPIP.
>>>>
>>>> +0: Don't really care.
>>>>
>>>> -1: I don't think this is a good idea because of the following
>>>> technical reasons.
>>>>
>>>>
>>>>
>>>> Thanks!
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Herman van Hövell
>>>>
>>>> Software Engineer
>>>>
>>>> Databricks Inc.
>>>>
>>>> hvanhov...@databricks.com
>>>>
>>>> +31 6 420 590 27
>>>>
>>>> databricks.com
>>>>
>>>> [image: http://databricks.com] <http://databricks.com/>
>>>>
>>>>
>>>>
>>>> [image: Announcing Databricks Serverless. The first serverless data
>>>> science and big data platform. Watch the demo from Spark Summit 2017.]
>>>> <http://go.databricks.com/announcing-databricks-serverless>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Ryan Blue
>>>>
>>>> Software Engineer
>>>>
>>>> Netflix
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

Reply via email to