Re: data source v2 online meetup

Ryan Blue Thu, 01 Feb 2018 11:16:12 -0800

We don't mind updating Iceberg when the API improves. We are fully aware
that this is a very early implementation and will change. My hope is that
the community is receptive to our suggestions.

A good example of an area with friction is filter and projection push-down.
The implementation for DSv2 isn't based on what the other read paths do, it
is a brand new and mostly untested. I don't really understand why DSv2
introduced a new code path, when reusing existing code for this ended up
being smaller and works for more cases (see my comments on #20476
<https://github.com/apache/spark/pull/20476>). I understand wanting to fix
parts of push-down, just not why it is a good idea to mix that substantial
change into an unrelated API update. This is one area where, I hope, our
suggestion to get DSv2 working well and redesign push-down as a parallel
effort is heard.

I also see a few areas where the integration of DSv2 conflicts with what I
understand to be design principles of the catalyst optimizer. The fact that
it should use immutable nodes in plans is mostly settled, but there are
other examples. The approach of the new push-down implementation fights
against the principle of small rules that don't need to process the entire
plan tree. I think this makes the component brittle, and I'd like to
understand the rationale for going with this design. I'd love to see a
design document that covers why this is a necessary choice (but again,
separately).

rb

On Thu, Feb 1, 2018 at 9:10 AM, Felix Cheung <felixcheun...@hotmail.com>
wrote:

> +1 hangout
>
> ------------------------------
> *From:* Xiao Li <gatorsm...@gmail.com>
> *Sent:* Wednesday, January 31, 2018 10:46:26 PM
> *To:* Ryan Blue
> *Cc:* Reynold Xin; dev; Wenchen Fen; Russell Spitzer
> *Subject:* Re: data source v2 online meetup
>
> Hi, Ryan,
>
> wow, your Iceberg already used data source V2 API! That is pretty cool! I
> am just afraid these new APIs are not stable. We might deprecate or change
> some data source v2 APIs in the next version (2.4). Sorry for the
> inconvenience it might introduce.
>
> Thanks for your feedback always,
>
> Xiao
>
>
> 2018-01-31 15:54 GMT-08:00 Ryan Blue <rb...@netflix.com.invalid>:
>
>> Thanks for suggesting this, I think it's a great idea. I'll definitely
>> attend and can talk about the changes that we've made DataSourceV2 to
>> enable our new table format, Iceberg
>> <https://github.com/Netflix/iceberg#about-iceberg>.
>>
>> On Wed, Jan 31, 2018 at 2:35 PM, Reynold Xin <r...@databricks.com> wrote:
>>
>>> Data source v2 API is one of the larger main changes in Spark 2.3, and
>>> whatever that has already been committed is only the first version and we'd
>>> need more work post-2.3 to improve and stablize it.
>>>
>>> I think at this point we should stop making changes to it in branch-2.3,
>>> and instead focus on using the existing API and getting feedback for 2.4.
>>> Would people be interested in doing an online hangout to discuss this,
>>> perhaps in the month of Feb?
>>>
>>> It'd be more productive if people attending the hangout have tried the
>>> API by implementing some new sources or porting an existing source over.
>>>
>>>
>>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: data source v2 online meetup

Reply via email to