gt;>> Vaquar khan
>>>
>>> On Sep 10, 2017 5:18 AM, "Noman Khan" wrote:
>>>
>>>> +1
>>>> --
>>>> *From:* wangzhenhua (G)
>>>> *Sent:* Friday, September 8, 2017 2:20:07 AM
>&g
7 AM
>>> *To:* Dongjoon Hyun; 蒋星博
>>> *Cc:* Michael Armbrust; Reynold Xin; Andrew Ash; Herman van Hövell tot
>>> Westerflier; Ryan Blue; Spark dev list; Suresh Thalamati; Wenchen Fan
>>> *Subject:* 答复: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path
>>>
man van Hövell tot
>> Westerflier; Ryan Blue; Spark dev list; Suresh Thalamati; Wenchen Fan
>> *Subject:* 答复: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path
>>
>>
>> +1 (non-binding) Great to see data source API is going to be improved!
>>
>>
>>
van Hövell tot
> Westerflier; Ryan Blue; Spark dev list; Suresh Thalamati; Wenchen Fan
> *Subject:* 答复: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path
>
>
> +1 (non-binding) Great to see data source API is going to be improved!
>
>
>
> best regards,
>
> -Zh
Westerflier; Ryan Blue; Spark dev list; Suresh Thalamati; Wenchen Fan
主题: Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path
+1 (non-binding).
On Thu, Sep 7, 2017 at 12:46 PM, 蒋星博
mailto:jiangxb1...@gmail.com>> wrote:
+1
Reynold Xin mailto:r...@databricks.com>>于2017年9月7日
周四下午12
; Suresh Thalamati; Wenchen Fan
主题: Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path
+1 (non-binding).
On Thu, Sep 7, 2017 at 12:46 PM, 蒋星博
mailto:jiangxb1...@gmail.com>> wrote:
+1
Reynold Xin mailto:r...@databricks.com>>于2017年9月7日
周四下午12:04写道:
+1 as well
On Thu, Sep 7,
+1 (non-binding).
On Thu, Sep 7, 2017 at 12:46 PM, 蒋星博 wrote:
> +1
>
>
> Reynold Xin 于2017年9月7日 周四下午12:04写道:
>
>> +1 as well
>>
>> On Thu, Sep 7, 2017 at 9:12 PM, Michael Armbrust
>> wrote:
>>
>>> +1
>>>
>>> On Thu, Sep 7, 2017 at 9:32 AM, Ryan Blue
>>> wrote:
>>>
+1 (non-binding)
>>
+1
Reynold Xin 于2017年9月7日 周四下午12:04写道:
> +1 as well
>
> On Thu, Sep 7, 2017 at 9:12 PM, Michael Armbrust
> wrote:
>
>> +1
>>
>> On Thu, Sep 7, 2017 at 9:32 AM, Ryan Blue
>> wrote:
>>
>>> +1 (non-binding)
>>>
>>> Thanks for making the updates reflected in the current PR. It would be
>>> great t
+1 as well
On Thu, Sep 7, 2017 at 9:12 PM, Michael Armbrust
wrote:
> +1
>
> On Thu, Sep 7, 2017 at 9:32 AM, Ryan Blue
> wrote:
>
>> +1 (non-binding)
>>
>> Thanks for making the updates reflected in the current PR. It would be
>> great to see the doc updated before it is finally published though
+1
On Thu, Sep 7, 2017 at 9:32 AM, Ryan Blue wrote:
> +1 (non-binding)
>
> Thanks for making the updates reflected in the current PR. It would be
> great to see the doc updated before it is finally published though.
>
> Right now it feels like this SPIP is focused more on getting the basics
> ri
+1 (non-binding)
Thanks for making the updates reflected in the current PR. It would be
great to see the doc updated before it is finally published though.
Right now it feels like this SPIP is focused more on getting the basics
right for what many datasources are already doing in API V1 combined
+1 (binding)
I personally believe that there is quite a big difference between having a
generic data source interface with a low surface area and pushing down a
significant part of query processing into a datasource. The later has much
wider wider surface area and will require us to stabilize most
+0 (non-binding)
I think there are benefits to unifying all the Spark-internal datasources
into a common public API for sure. It will serve as a forcing function to
ensure that those internal datasources aren't advantaged vs datasources
developed externally as plugins to Spark, and that all Spark
+1 (non-binding)
> On Sep 6, 2017, at 7:29 PM, Wenchen Fan wrote:
>
> Hi all,
>
> In the previous discussion, we decided to split the read and write path of
> data source v2 into 2 SPIPs, and I'm sending this email to call a vote for
> Data Source V2 read path only.
>
> The full document of
+1
On Wed, Sep 6, 2017 at 8:53 PM, Xiao Li wrote:
> +1
>
> Xiao
>
> 2017-09-06 19:37 GMT-07:00 Wenchen Fan :
>
>> adding my own +1 (binding)
>>
>> On Thu, Sep 7, 2017 at 10:29 AM, Wenchen Fan wrote:
>>
>>> Hi all,
>>>
>>> In the previous discussion, we decided to split the read and write path
>
+1
Xiao
2017-09-06 19:37 GMT-07:00 Wenchen Fan :
> adding my own +1 (binding)
>
> On Thu, Sep 7, 2017 at 10:29 AM, Wenchen Fan wrote:
>
>> Hi all,
>>
>> In the previous discussion, we decided to split the read and write path
>> of data source v2 into 2 SPIPs, and I'm sending this email to call
adding my own +1 (binding)
On Thu, Sep 7, 2017 at 10:29 AM, Wenchen Fan wrote:
> Hi all,
>
> In the previous discussion, we decided to split the read and write path of
> data source v2 into 2 SPIPs, and I'm sending this email to call a vote for
> Data Source V2 read path only.
>
> The full docum
Hi Ryan,
Yea I agree with you that we should discuss some substantial details during
the vote, and I addressed your comments about schema inference API in my
new PR, please take a look.
I've also called a new vote for the read path, please vote there, thanks!
On Thu, Sep 7, 2017 at 7:55 AM, Ryan
Hi all,
In the previous discussion, we decided to split the read and write path of
data source v2 into 2 SPIPs, and I'm sending this email to call a vote for
Data Source V2 read path only.
The full document of the Data Source API V2 is:
https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ
I'm all for keeping this moving and not getting too far into the details
(like naming), but I think the substantial details should be clarified
first since they are in the proposal that's being voted on.
I would prefer moving the write side to a separate SPIP, too, since there
isn't much detail in
Hi all,
I've submitted a PR for a basic data source v2, i.e., only contains
features we already have in data source v1. We can discuss API details like
naming in that PR: https://github.com/apache/spark/pull/19136
In the meanwhile, let's keep this vote open and collecting more feedbacks.
Thanks
Why does ordering matter here for sort vs filter? The source should be able
to handle it in whatever way it wants (which is almost always filter
beneath sort I'd imagine).
The only ordering that'd matter in the current set of pushdowns is limit -
it should always mean the root of the pushded tree.
> Ideally also getting sort orders _after_ getting filters.
Yea we should have a deterministic order when applying various push downs,
and I think filter should definitely go before sort. This is one of the
details we can discuss during PR review :)
On Fri, Sep 1, 2017 at 9:19 AM, James Baker wr
I think that makes sense. I didn't understand backcompat was the primary
driver. I actually don't care right now about aggregations on the datasource
I'm integrating with - I just care about receiving all the filters (and ideally
also the desired sort order) at the same time. I am mostly fine wi
I think that makes sense. I didn't understand backcompat was the primary
driver. I actually don't care right now about aggregations on the datasource
I'm integrating with - I just care about receiving all the filters (and ideally
also the desired sort order) at the same time. I am mostly fine wi
I think that makes sense. I didn't understand backcompat was the primary
driver. I actually don't care right now about aggregations on the datasource
I'm integrating with - I just care about receiving all the filters (and ideally
also the desired sort order) at the same time. I am mostly fine wi
I think that makes sense. I didn't understand backcompat was the primary
driver. I actually don't care right now about aggregations on the datasource
I'm integrating with - I just care about receiving all the filters (and ideally
also the desired sort order) at the same time. I am mostly fine wi
Hi Ryan,
I think for a SPIP, we should not worry too much about details, as we can
discuss them during PR review after the vote pass.
I think we should focus more on the overall design, like James did. The
interface mix-in vs plan push down discussion was great, hope we can get a
consensus on thi
Maybe I'm missing something, but the high-level proposal consists of:
Goals, Non-Goals, and Proposed API. What is there to discuss other than the
details of the API that's being proposed? I think the goals make sense, but
goals alone aren't enough to approve a SPIP.
On Wed, Aug 30, 2017 at 2:46 PM
I guess I was more suggesting that by coding up the powerful mode as the API,
it becomes easy for someone to layer an easy mode beneath it to enable simpler
datasources to be integrated (and that simple mode should be the out of scope
thing).
Taking a small step back here, one of the places whe
I guess I was more suggesting that by coding up the powerful mode as the API,
it becomes easy for someone to layer an easy mode beneath it to enable simpler
datasources to be integrated (and that simple mode should be the out of scope
thing).
Taking a small step back here, one of the places whe
I guess I was more suggesting that by coding up the powerful mode as the API,
it becomes easy for someone to layer an easy mode beneath it to enable simpler
datasources to be integrated (and that simple mode should be the out of scope
thing).
Taking a small step back here, one of the places whe
Sure that's good to do (and as discussed earlier a good compromise might be
to expose an interface for the source to decide which part of the logical
plan they want to accept).
To me everything is about cost vs benefit.
In my mind, the biggest issue with the existing data source API is backward
a
I guess I was more suggesting that by coding up the powerful mode as the API,
it becomes easy for someone to layer an easy mode beneath it to enable simpler
datasources to be integrated (and that simple mode should be the out of scope
thing).
Taking a small step back here, one of the places whe
So we seem to be getting into a cycle of discussing more about the details
of APIs than the high level proposal. The details of APIs are important to
debate, but those belong more in code reviews.
One other important thing is that we should avoid API design by committee.
While it is extremely usef
-1 (non-binding)
Sometimes it takes a VOTE thread to get people to actually read and
comment, so thanks for starting this one… but there’s still discussion
happening on the prototype API, which it hasn’t been updated. I’d like to
see the proposal shaped by the ongoing discussion so that we have a
That might be good to do, but seems like orthogonal to this effort itself.
It would be a completely different interface.
On Wed, Aug 30, 2017 at 1:10 PM Wenchen Fan wrote:
> OK I agree with it, how about we add a new interface to push down the
> query plan, based on the current framework? We can
OK I agree with it, how about we add a new interface to push down the query
plan, based on the current framework? We can mark the query-plan-push-down
interface as unstable, to save the effort of designing a stable
representation of query plan and maintaining forward compatibility.
On Wed, Aug 30,
I'll just focus on the one-by-one thing for now - it's the thing that blocks me
the most.
I think the place where we're most confused here is on the cost of determining
whether I can push down a filter. For me, in order to work out whether I can
push down a filter or satisfy a sort, I might hav
I'll just focus on the one-by-one thing for now - it's the thing that blocks me
the most.
I think the place where we're most confused here is on the cost of determining
whether I can push down a filter. For me, in order to work out whether I can
push down a filter or satisfy a sort, I might hav
I'll just focus on the one-by-one thing for now - it's the thing that blocks me
the most.
I think the place where we're most confused here is on the cost of determining
whether I can push down a filter. For me, in order to work out whether I can
push down a filter or satisfy a sort, I might hav
Hi James,
Thanks for your feedback! I think your concerns are all valid, but we need
to make a tradeoff here.
> Explicitly here, what I'm looking for is a convenient mechanism to accept
a fully specified set of arguments
The problem with this approach is: 1) if we wanna add more arguments in the
Yeah, for sure.
With the stable representation - agree that in the general case this is pretty
intractable, it restricts the modifications that you can do in the future too
much. That said, it shouldn't be as hard if you restrict yourself to the parts
of the plan which are supported by the data
Yeah, for sure.
With the stable representation - agree that in the general case this is pretty
intractable, it restricts the modifications that you can do in the future too
much. That said, it shouldn't be as hard if you restrict yourself to the parts
of the plan which are supported by the data
James,
Thanks for the comment. I think you just pointed out a trade-off between
expressiveness and API simplicity, compatibility and evolvability. For the
max expressiveness, we'd want the ability to expose full query plans, and
let the data source decide which part of the query plan can be pushed
Copying from the code review comments I just submitted on the draft API
(https://github.com/cloud-fan/spark/pull/10#pullrequestreview-59088745):
Context here is that I've spent some time implementing a Spark datasource and
have had some issues with the current API which are made worse in V2.
Th
Copying from the code review comments I just submitted on the draft API
(https://github.com/cloud-fan/spark/pull/10#pullrequestreview-59088745):
Context here is that I've spent some time implementing a Spark datasource and
have had some issues with the current API which are made worse in V2.
Th
Copying from the code review comments I just submitted on the draft API
(https://github.com/cloud-fan/spark/pull/10#pullrequestreview-59088745):
Context here is that I've spent some time implementing a Spark datasource and
have had some issues with the current API which are made worse in V2.
Th
+1 (Non-binding)
Xiao Li 于2017年8月28日 周一下午5:38写道:
> +1
>
> 2017-08-28 12:45 GMT-07:00 Cody Koeninger :
>
>> Just wanted to point out that because the jira isn't labeled SPIP, it
>> won't have shown up linked from
>>
>> http://spark.apache.org/improvement-proposals.html
>>
>> On Mon, Aug 28, 2017 a
+1
2017-08-28 12:45 GMT-07:00 Cody Koeninger :
> Just wanted to point out that because the jira isn't labeled SPIP, it
> won't have shown up linked from
>
> http://spark.apache.org/improvement-proposals.html
>
> On Mon, Aug 28, 2017 at 2:20 PM, Wenchen Fan wrote:
> > Hi all,
> >
> > It has been
Just wanted to point out that because the jira isn't labeled SPIP, it
won't have shown up linked from
http://spark.apache.org/improvement-proposals.html
On Mon, Aug 28, 2017 at 2:20 PM, Wenchen Fan wrote:
> Hi all,
>
> It has been almost 2 weeks since I proposed the data source V2 for
> discussi
+1 (Non-binding)
The clustering approach covers most of my requirements on saving some
shuffles. We kind of left the "should the user be allowed to provide a full
partitioner" discussion on the table. I understand that would require
exposing a lot of internals so this is perhaps a good compromise.
Hi all,
It has been almost 2 weeks since I proposed the data source V2 for
discussion, and we already got some feedbacks on the JIRA ticket and the
prototype PR, so I'd like to call for a vote.
The full document of the Data Source API V2 is:
https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEo
Points 2, 3 and 4 of the Project Plan in that document (i.e. "port existing
data sources using internal APIs to use the proposed public Data Source V2
API") have my full support. Really, I'd like to see that dog-fooding effort
completed and lesson learned from it fully digested before we remove any
Yea I don't think it's a good idea to upload a doc and then call for a vote
immediately. People need time to digest ...
On Thu, Aug 17, 2017 at 6:22 AM, Wenchen Fan wrote:
> Sorry let's remove the VOTE tag as I just wanna bring this up for
> discussion.
>
> I'll restart the voting process after
Sorry let's remove the VOTE tag as I just wanna bring this up for
discussion.
I'll restart the voting process after we have enough discussion on the JIRA
ticket or here in this email thread.
On Thu, Aug 17, 2017 at 9:12 PM, Russell Spitzer
wrote:
> -1, I don't think there has really been any di
-1, I don't think there has really been any discussion of this api change
yet or at least it hasn't occurred on the jira ticket
On Thu, Aug 17, 2017 at 8:05 AM Wenchen Fan wrote:
> adding my own +1 (binding)
>
> On Thu, Aug 17, 2017 at 9:02 PM, Wenchen Fan wrote:
>
>> Hi all,
>>
>> Following th
+1 (non-binding)
Wenchen Fan 于2017年8月17日 周四下午9:05写道:
> adding my own +1 (binding)
>
> On Thu, Aug 17, 2017 at 9:02 PM, Wenchen Fan wrote:
>
>> Hi all,
>>
>> Following the SPIP process, I'm putting this SPIP up for a vote.
>>
>> The current data source API doesn't work well because of some limita
adding my own +1 (binding)
On Thu, Aug 17, 2017 at 9:02 PM, Wenchen Fan wrote:
> Hi all,
>
> Following the SPIP process, I'm putting this SPIP up for a vote.
>
> The current data source API doesn't work well because of some limitations
> like: no partitioning/bucketing support, no columnar read,
Hi all,
Following the SPIP process, I'm putting this SPIP up for a vote.
The current data source API doesn't work well because of some limitations
like: no partitioning/bucketing support, no columnar read, hard to support
more operator push down, etc.
I'm proposing a Data Source API V2 to addres
60 matches
Mail list logo