Hi everyone, I have been thinking about the view support during the weekend, and I realize there is a conflict that Trino today already claims to support Iceberg view through Hive metastore.
I believe we need to figure out a path forward around this issue before voting to pass the current proposal to avoid confusions for end users. I have summarized the issue here with a few different potential solutions: https://docs.google.com/document/d/1uupI7JJHEZIkHufo7sU4Enpwgg-ODCVBE6ocFUVD9oQ/edit?usp=sharing Please let me know what you think. Best, Jack Ye On Thu, Aug 26, 2021 at 3:29 PM Phillip Cloud <cpcl...@gmail.com> wrote: > On Thu, Aug 26, 2021 at 6:07 PM Jacques Nadeau <jacquesnad...@gmail.com> > wrote: > >> >> On Thu, Aug 26, 2021 at 2:44 PM Ryan Blue <b...@tabular.io> wrote: >> >>> Would a physical plan be portable for the purpose of an engine-agnostic >>> view? >>> >> >> My goal is it would be. There may be optional "hints" that a particular >> engine could leverage and others wouldn't but I think the goal should be >> that the IR is entirely engine-agnostic. Even in the Arrow project proper, >> there are really two independent heavy-weight engines that have their own >> capabilities and trajectories (c++ vs rust). >> >> >>> Physical plan details seem specific to an engine to me, but maybe I'm >>> thinking too much about how Spark is implemented. My inclination would be >>> to accept only logical IR, which could just mean accepting a subset of the >>> standard. >>> >> >> I think it is very likely that different consumers will only support a >> subset of plans. That being said, I'm not sure what you're specifically >> trying to mitigate or avoid. I'd be inclined to simply allow the full >> breadth of IR within Iceberg. If it is well specified, an engine can either >> choose to execute or not (same as the proposal wrt to SQL syntax or if a >> function is missing on an engine). The engine may even have internal >> rewrites if it likes doing things a different way than what is requested. >> > > I also believe that consumers will not be expected to support all plans. > It will depend on the consumer, but many of the instanations of Read/Write > relations won't be executable for many consumers, for example. > > >> >> >>> The document that Micah linked to is interesting, but I'm not sure that >>> our goals are aligned. >>> >> >> I think there is much commonality here and I'd argue it would be best to >> really try to see if a unified set of goals works well. I think Arrow IR is >> young enough that it can still be shaped/adapted. It may be that there >> should be some give or take on each side. It's possible that the goals are >> too far apart to unify but my gut is that they are close enough that we >> should try since it would be a great force multiplier. >> >> >>> For one thing, it seems to make assumptions about the IR being used for >>> Arrow data (at least in Wes' proposal), when I think that it may be easier >>> to be agnostic to vectorization. >>> >> >> Other than using the Arrow schema/types, I'm not at all convinced that >> the IR should be Arrow centric. I've actually argued to some that Arrow IR >> should be independent of Arrow to be its best self. Let's try to review it >> and see if/where we can avoid a tight coupling between plans and arrow >> specific concepts. >> > > Just to echo Jacques's comments here, the only thing that is Arrow > specific right now is the use of its type system. Literals, for example, > are encoded entirely in flatbuffers. > > Would love feedback on the current PR [1]. I'm looking to merge the first > iteration soonish, so please review at your earliest convenience. > > >> >> >>> It also delegates forward/backward compatibility to flatbuffers, when I >>> think compatibility should be part of the semantics and not delegated to >>> serialization. For example, if I have Join("inner", a.id, b.id) and I >>> evolve that to allow additional predicates Join("inner", a.id, b.id, >>> a.x < b.y) then just because I can deserialize it doesn't mean it is >>> compatible. >>> >> >> I don't think that flatbuffers alone can solve all compatibility >> problems. It can solve some and I'd expect that implementation libraries >> will have to solve others. Would love to hear if others disagree (and think >> flatbuffers can solve everything wrt compatibility). >> > > I agree, I think you need both to achieve sane versioning. The version > needs to be shipped along with the IR, and libraries need to be able deal > with the different versions. I could be wrong, but I think it probably > makes more sense to start versioning the IR once the dust has settled a bit. > > >> >> J >> > > [1]: https://github.com/apache/arrow/pull/10934 >