Re: Proposal: Support for views in Iceberg

Anjali Norwood Tue, 28 Sep 2021 11:42:15 -0700

Hi All,

Please see the spec in markdown format at the PR here
<https://github.com/apache/iceberg/pull/3188> to facilitate
adding/responding to comments. Please review.


thanks,
Anjali

On Tue, Sep 7, 2021 at 9:31 PM Jack Ye <yezhao...@gmail.com> wrote:

> Hi everyone,
>
> I have been thinking about the view support during the weekend, and I
> realize there is a conflict that Trino today already claims to support
> Iceberg view through Hive metastore.
>
> I believe we need to figure out a path forward around this issue before
> voting to pass the current proposal to avoid confusions for end users. I
> have summarized the issue here with a few different potential solutions:
>
>
> https://docs.google.com/document/d/1uupI7JJHEZIkHufo7sU4Enpwgg-ODCVBE6ocFUVD9oQ/edit?usp=sharing
>
> Please let me know what you think.
>
> Best,
> Jack Ye
>
> On Thu, Aug 26, 2021 at 3:29 PM Phillip Cloud <cpcl...@gmail.com> wrote:
>
>> On Thu, Aug 26, 2021 at 6:07 PM Jacques Nadeau <jacquesnad...@gmail.com>
>> wrote:
>>
>>>
>>> On Thu, Aug 26, 2021 at 2:44 PM Ryan Blue <b...@tabular.io> wrote:
>>>
>>>> Would a physical plan be portable for the purpose of an engine-agnostic
>>>> view?
>>>>
>>>
>>> My goal is it would be. There may be optional "hints" that a particular
>>> engine could leverage and others wouldn't but I think the goal should be
>>> that the IR is entirely engine-agnostic. Even in the Arrow project proper,
>>> there are really two independent heavy-weight engines that have their own
>>> capabilities and trajectories (c++ vs rust).
>>>
>>>
>>>> Physical plan details seem specific to an engine to me, but maybe I'm
>>>> thinking too much about how Spark is implemented. My inclination would be
>>>> to accept only logical IR, which could just mean accepting a subset of the
>>>> standard.
>>>>
>>>
>>> I think it is very likely that different consumers will only support a
>>> subset of plans. That being said, I'm not sure what you're specifically
>>> trying to mitigate or avoid. I'd be inclined to simply allow the full
>>> breadth of IR within Iceberg. If it is well specified, an engine can either
>>> choose to execute or not (same as the proposal wrt to SQL syntax or if a
>>> function is missing on an engine). The engine may even have internal
>>> rewrites if it likes doing things a different way than what is requested.
>>>
>>
>> I also believe that consumers will not be expected to support all plans.
>> It will depend on the consumer, but many of the instanations of Read/Write
>> relations won't be executable for many consumers, for example.
>>
>>
>>>
>>>
>>>> The document that Micah linked to is interesting, but I'm not sure that
>>>> our goals are aligned.
>>>>
>>>
>>> I think there is much commonality here and I'd argue it would be best to
>>> really try to see if a unified set of goals works well. I think Arrow IR is
>>> young enough that it can still be shaped/adapted. It may be that there
>>> should be some give or take on each side. It's possible that the goals are
>>> too far apart to unify but my gut is that they are close enough that we
>>> should try since it would be a great force multiplier.
>>>
>>>
>>>> For one thing, it seems to make assumptions about the IR being used for
>>>> Arrow data (at least in Wes' proposal), when I think that it may be easier
>>>> to be agnostic to vectorization.
>>>>
>>>
>>> Other than using the Arrow schema/types, I'm not at all convinced that
>>> the IR should be Arrow centric. I've actually argued to some that Arrow IR
>>> should be independent of Arrow to be its best self. Let's try to review it
>>> and see if/where we can avoid a tight coupling between plans and arrow
>>> specific concepts.
>>>
>>
>> Just to echo Jacques's comments here, the only thing that is Arrow
>> specific right now is the use of its type system. Literals, for example,
>> are encoded entirely in flatbuffers.
>>
>> Would love feedback on the current PR [1]. I'm looking to merge the first
>> iteration soonish, so please review at your earliest convenience.
>>
>>
>>>
>>>
>>>> It also delegates forward/backward compatibility to flatbuffers, when I
>>>> think compatibility should be part of the semantics and not delegated to
>>>> serialization. For example, if I have Join("inner", a.id, b.id) and I
>>>> evolve that to allow additional predicates Join("inner", a.id, b.id,
>>>> a.x < b.y) then just because I can deserialize it doesn't mean it is
>>>> compatible.
>>>>
>>>
>>> I don't think that flatbuffers alone can solve all compatibility
>>> problems. It can solve some and I'd expect that implementation libraries
>>> will have to solve others. Would love to hear if others disagree (and think
>>> flatbuffers can solve everything wrt compatibility).
>>>
>>
>> I agree, I think you need both to achieve sane versioning. The version
>> needs to be shipped along with the IR, and libraries need to be able deal
>> with the different versions. I could be wrong, but I think it probably
>> makes more sense to start versioning the IR once the dust has settled a bit.
>>
>>
>>>
>>> J
>>>
>>
>> [1]: https://github.com/apache/arrow/pull/10934
>>
>

Re: Proposal: Support for views in Iceberg

Reply via email to