Re: Proposal: Support for views in Iceberg

Jack Ye Tue, 07 Sep 2021 21:32:02 -0700

Hi everyone,

I have been thinking about the view support during the weekend, and I
realize there is a conflict that Trino today already claims to support
Iceberg view through Hive metastore.


I believe we need to figure out a path forward around this issue before
voting to pass the current proposal to avoid confusions for end users. I
have summarized the issue here with a few different potential solutions:

https://docs.google.com/document/d/1uupI7JJHEZIkHufo7sU4Enpwgg-ODCVBE6ocFUVD9oQ/edit?usp=sharing

Please let me know what you think.

Best,
Jack Ye

On Thu, Aug 26, 2021 at 3:29 PM Phillip Cloud <[email protected]> wrote:

> On Thu, Aug 26, 2021 at 6:07 PM Jacques Nadeau <[email protected]>
> wrote:
>
>>
>> On Thu, Aug 26, 2021 at 2:44 PM Ryan Blue <[email protected]> wrote:
>>
>>> Would a physical plan be portable for the purpose of an engine-agnostic
>>> view?
>>>
>>
>> My goal is it would be. There may be optional "hints" that a particular
>> engine could leverage and others wouldn't but I think the goal should be
>> that the IR is entirely engine-agnostic. Even in the Arrow project proper,
>> there are really two independent heavy-weight engines that have their own
>> capabilities and trajectories (c++ vs rust).
>>
>>
>>> Physical plan details seem specific to an engine to me, but maybe I'm
>>> thinking too much about how Spark is implemented. My inclination would be
>>> to accept only logical IR, which could just mean accepting a subset of the
>>> standard.
>>>
>>
>> I think it is very likely that different consumers will only support a
>> subset of plans. That being said, I'm not sure what you're specifically
>> trying to mitigate or avoid. I'd be inclined to simply allow the full
>> breadth of IR within Iceberg. If it is well specified, an engine can either
>> choose to execute or not (same as the proposal wrt to SQL syntax or if a
>> function is missing on an engine). The engine may even have internal
>> rewrites if it likes doing things a different way than what is requested.
>>
>
> I also believe that consumers will not be expected to support all plans.
> It will depend on the consumer, but many of the instanations of Read/Write
> relations won't be executable for many consumers, for example.
>
>
>>
>>
>>> The document that Micah linked to is interesting, but I'm not sure that
>>> our goals are aligned.
>>>
>>
>> I think there is much commonality here and I'd argue it would be best to
>> really try to see if a unified set of goals works well. I think Arrow IR is
>> young enough that it can still be shaped/adapted. It may be that there
>> should be some give or take on each side. It's possible that the goals are
>> too far apart to unify but my gut is that they are close enough that we
>> should try since it would be a great force multiplier.
>>
>>
>>> For one thing, it seems to make assumptions about the IR being used for
>>> Arrow data (at least in Wes' proposal), when I think that it may be easier
>>> to be agnostic to vectorization.
>>>
>>
>> Other than using the Arrow schema/types, I'm not at all convinced that
>> the IR should be Arrow centric. I've actually argued to some that Arrow IR
>> should be independent of Arrow to be its best self. Let's try to review it
>> and see if/where we can avoid a tight coupling between plans and arrow
>> specific concepts.
>>
>
> Just to echo Jacques's comments here, the only thing that is Arrow
> specific right now is the use of its type system. Literals, for example,
> are encoded entirely in flatbuffers.
>
> Would love feedback on the current PR [1]. I'm looking to merge the first
> iteration soonish, so please review at your earliest convenience.
>
>
>>
>>
>>> It also delegates forward/backward compatibility to flatbuffers, when I
>>> think compatibility should be part of the semantics and not delegated to
>>> serialization. For example, if I have Join("inner", a.id, b.id) and I
>>> evolve that to allow additional predicates Join("inner", a.id, b.id,
>>> a.x < b.y) then just because I can deserialize it doesn't mean it is
>>> compatible.
>>>
>>
>> I don't think that flatbuffers alone can solve all compatibility
>> problems. It can solve some and I'd expect that implementation libraries
>> will have to solve others. Would love to hear if others disagree (and think
>> flatbuffers can solve everything wrt compatibility).
>>
>
> I agree, I think you need both to achieve sane versioning. The version
> needs to be shipped along with the IR, and libraries need to be able deal
> with the different versions. I could be wrong, but I think it probably
> makes more sense to start versioning the IR once the dust has settled a bit.
>
>
>>
>> J
>>
>
> [1]: https://github.com/apache/arrow/pull/10934
>

Re: Proposal: Support for views in Iceberg

Reply via email to