James,

Thanks for the comment. I think you just pointed out a trade-off between
expressiveness and API simplicity, compatibility and evolvability. For the
max expressiveness, we'd want the ability to expose full query plans, and
let the data source decide which part of the query plan can be pushed down.

The downside to that (full query plan push down) are:

1. It is extremely difficult to design a stable representation for logical
/ physical plan. It is doable, but we'd be the first to do it. I'm not sure
of any mainstream databases being able to do that in the past. The design
of that API itself, to make sure we have a good story for backward and
forward compatibility, would probably take months if not years. It might
still be good to do, or offer an experimental trait without compatibility
guarantee that uses the current Catalyst internal logical plan.

2. Most data source developers simply want a way to offer some data,
without any pushdown. Having to understand query plans is a burden rather
than a gift.


Re: your point about the proposed v2 being worse than v1 for your use case.

Can you say more? You used the argument that in v2 there are more support
for broader pushdown and as a result it is harder to implement. That's how
it is supposed to be. If a data source simply implements one of the trait,
it'd be logically identical to v1. I don't see why it would be worse or
better, other than v2 provides much stronger forward compatibility
guarantees than v1.


On Tue, Aug 29, 2017 at 4:54 AM, James Baker <j.ba...@outlook.com> wrote:

> Copying from the code review comments I just submitted on the draft API (
> https://github.com/cloud-fan/spark/pull/10#pullrequestreview-59088745):
>
> Context here is that I've spent some time implementing a Spark datasource
> and have had some issues with the current API which are made worse in V2.
>
> The general conclusion I’ve come to here is that this is very hard to
> actually implement (in a similar but more aggressive way than DataSource
> V1, because of the extra methods and dimensions we get in V2).
>
> In DataSources V1 PrunedFilteredScan, the issue is that you are passed in
> the filters with the buildScan method, and then passed in again with the
> unhandledFilters method.
>
> However, the filters that you can’t handle might be data dependent, which
> the current API does not handle well. Suppose I can handle filter A some of
> the time, and filter B some of the time. If I’m passed in both, then either
> A and B are unhandled, or A, or B, or neither. The work I have to do to
> work this out is essentially the same as I have to do while actually
> generating my RDD (essentially I have to generate my partitions), so I end
> up doing some weird caching work.
>
> This V2 API proposal has the same issues, but perhaps moreso. In
> PrunedFilteredScan, there is essentially one degree of freedom for pruning
> (filters), so you just have to implement caching between unhandledFilters
> and buildScan. However, here we have many degrees of freedom; sorts,
> individual filters, clustering, sampling, maybe aggregations eventually -
> and these operations are not all commutative, and computing my support
> one-by-one can easily end up being more expensive than computing all in one
> go.
>
> For some trivial examples:
>
> - After filtering, I might be sorted, whilst before filtering I might not
> be.
>
> - Filtering with certain filters might affect my ability to push down
> others.
>
> - Filtering with aggregations (as mooted) might not be possible to push
> down.
>
> And with the API as currently mooted, I need to be able to go back and
> change my results because they might change later.
>
> Really what would be good here is to pass all of the filters and sorts etc
> all at once, and then I return the parts I can’t handle.
>
> I’d prefer in general that this be implemented by passing some kind of
> query plan to the datasource which enables this kind of replacement.
> Explicitly don’t want to give the whole query plan - that sounds painful -
> would prefer we push down only the parts of the query plan we deem to be
> stable. With the mix-in approach, I don’t think we can guarantee the
> properties we want without a two-phase thing - I’d really love to be able
> to just define a straightforward union type which is our supported pushdown
> stuff, and then the user can transform and return it.
>
> I think this ends up being a more elegant API for consumers, and also far
> more intuitive.
>
> James
>
> On Mon, 28 Aug 2017 at 18:00 蒋星博 <jiangxb1...@gmail.com> wrote:
>
>> +1 (Non-binding)
>>
>> Xiao Li <gatorsm...@gmail.com>于2017年8月28日 周一下午5:38写道:
>>
>>> +1
>>>
>>> 2017-08-28 12:45 GMT-07:00 Cody Koeninger <c...@koeninger.org>:
>>>
>>>> Just wanted to point out that because the jira isn't labeled SPIP, it
>>>> won't have shown up linked from
>>>>
>>>> http://spark.apache.org/improvement-proposals.html
>>>>
>>>> On Mon, Aug 28, 2017 at 2:20 PM, Wenchen Fan <cloud0...@gmail.com>
>>>> wrote:
>>>> > Hi all,
>>>> >
>>>> > It has been almost 2 weeks since I proposed the data source V2 for
>>>> > discussion, and we already got some feedbacks on the JIRA ticket and
>>>> the
>>>> > prototype PR, so I'd like to call for a vote.
>>>> >
>>>> > The full document of the Data Source API V2 is:
>>>> > https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ
>>>> -Z8qU5Frf6WMQZ6jJVM/edit
>>>> >
>>>> > Note that, this vote should focus on high-level design/framework, not
>>>> > specified APIs, as we can always change/improve specified APIs during
>>>> > development.
>>>> >
>>>> > The vote will be up for the next 72 hours. Please reply with your
>>>> vote:
>>>> >
>>>> > +1: Yeah, let's go forward and implement the SPIP.
>>>> > +0: Don't really care.
>>>> > -1: I don't think this is a good idea because of the following
>>>> technical
>>>> > reasons.
>>>> >
>>>> > Thanks!
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>
>>>>
>>>

Reply via email to