Hi All, Please see the spec in markdown format at the PR here <https://github.com/apache/iceberg/pull/3188> to facilitate adding/responding to comments. Please review.
thanks, Anjali On Tue, Sep 7, 2021 at 9:31 PM Jack Ye <yezhao...@gmail.com> wrote: > Hi everyone, > > I have been thinking about the view support during the weekend, and I > realize there is a conflict that Trino today already claims to support > Iceberg view through Hive metastore. > > I believe we need to figure out a path forward around this issue before > voting to pass the current proposal to avoid confusions for end users. I > have summarized the issue here with a few different potential solutions: > > > https://docs.google.com/document/d/1uupI7JJHEZIkHufo7sU4Enpwgg-ODCVBE6ocFUVD9oQ/edit?usp=sharing > > Please let me know what you think. > > Best, > Jack Ye > > On Thu, Aug 26, 2021 at 3:29 PM Phillip Cloud <cpcl...@gmail.com> wrote: > >> On Thu, Aug 26, 2021 at 6:07 PM Jacques Nadeau <jacquesnad...@gmail.com> >> wrote: >> >>> >>> On Thu, Aug 26, 2021 at 2:44 PM Ryan Blue <b...@tabular.io> wrote: >>> >>>> Would a physical plan be portable for the purpose of an engine-agnostic >>>> view? >>>> >>> >>> My goal is it would be. There may be optional "hints" that a particular >>> engine could leverage and others wouldn't but I think the goal should be >>> that the IR is entirely engine-agnostic. Even in the Arrow project proper, >>> there are really two independent heavy-weight engines that have their own >>> capabilities and trajectories (c++ vs rust). >>> >>> >>>> Physical plan details seem specific to an engine to me, but maybe I'm >>>> thinking too much about how Spark is implemented. My inclination would be >>>> to accept only logical IR, which could just mean accepting a subset of the >>>> standard. >>>> >>> >>> I think it is very likely that different consumers will only support a >>> subset of plans. That being said, I'm not sure what you're specifically >>> trying to mitigate or avoid. I'd be inclined to simply allow the full >>> breadth of IR within Iceberg. If it is well specified, an engine can either >>> choose to execute or not (same as the proposal wrt to SQL syntax or if a >>> function is missing on an engine). The engine may even have internal >>> rewrites if it likes doing things a different way than what is requested. >>> >> >> I also believe that consumers will not be expected to support all plans. >> It will depend on the consumer, but many of the instanations of Read/Write >> relations won't be executable for many consumers, for example. >> >> >>> >>> >>>> The document that Micah linked to is interesting, but I'm not sure that >>>> our goals are aligned. >>>> >>> >>> I think there is much commonality here and I'd argue it would be best to >>> really try to see if a unified set of goals works well. I think Arrow IR is >>> young enough that it can still be shaped/adapted. It may be that there >>> should be some give or take on each side. It's possible that the goals are >>> too far apart to unify but my gut is that they are close enough that we >>> should try since it would be a great force multiplier. >>> >>> >>>> For one thing, it seems to make assumptions about the IR being used for >>>> Arrow data (at least in Wes' proposal), when I think that it may be easier >>>> to be agnostic to vectorization. >>>> >>> >>> Other than using the Arrow schema/types, I'm not at all convinced that >>> the IR should be Arrow centric. I've actually argued to some that Arrow IR >>> should be independent of Arrow to be its best self. Let's try to review it >>> and see if/where we can avoid a tight coupling between plans and arrow >>> specific concepts. >>> >> >> Just to echo Jacques's comments here, the only thing that is Arrow >> specific right now is the use of its type system. Literals, for example, >> are encoded entirely in flatbuffers. >> >> Would love feedback on the current PR [1]. I'm looking to merge the first >> iteration soonish, so please review at your earliest convenience. >> >> >>> >>> >>>> It also delegates forward/backward compatibility to flatbuffers, when I >>>> think compatibility should be part of the semantics and not delegated to >>>> serialization. For example, if I have Join("inner", a.id, b.id) and I >>>> evolve that to allow additional predicates Join("inner", a.id, b.id, >>>> a.x < b.y) then just because I can deserialize it doesn't mean it is >>>> compatible. >>>> >>> >>> I don't think that flatbuffers alone can solve all compatibility >>> problems. It can solve some and I'd expect that implementation libraries >>> will have to solve others. Would love to hear if others disagree (and think >>> flatbuffers can solve everything wrt compatibility). >>> >> >> I agree, I think you need both to achieve sane versioning. The version >> needs to be shipped along with the IR, and libraries need to be able deal >> with the different versions. I could be wrong, but I think it probably >> makes more sense to start versioning the IR once the dust has settled a bit. >> >> >>> >>> J >>> >> >> [1]: https://github.com/apache/arrow/pull/10934 >> >