On Thu, Aug 26, 2021 at 2:44 PM Ryan Blue <b...@tabular.io> wrote: > Would a physical plan be portable for the purpose of an engine-agnostic > view? >
My goal is it would be. There may be optional "hints" that a particular engine could leverage and others wouldn't but I think the goal should be that the IR is entirely engine-agnostic. Even in the Arrow project proper, there are really two independent heavy-weight engines that have their own capabilities and trajectories (c++ vs rust). > Physical plan details seem specific to an engine to me, but maybe I'm > thinking too much about how Spark is implemented. My inclination would be > to accept only logical IR, which could just mean accepting a subset of the > standard. > I think it is very likely that different consumers will only support a subset of plans. That being said, I'm not sure what you're specifically trying to mitigate or avoid. I'd be inclined to simply allow the full breadth of IR within Iceberg. If it is well specified, an engine can either choose to execute or not (same as the proposal wrt to SQL syntax or if a function is missing on an engine). The engine may even have internal rewrites if it likes doing things a different way than what is requested. > The document that Micah linked to is interesting, but I'm not sure that > our goals are aligned. > I think there is much commonality here and I'd argue it would be best to really try to see if a unified set of goals works well. I think Arrow IR is young enough that it can still be shaped/adapted. It may be that there should be some give or take on each side. It's possible that the goals are too far apart to unify but my gut is that they are close enough that we should try since it would be a great force multiplier. > For one thing, it seems to make assumptions about the IR being used for > Arrow data (at least in Wes' proposal), when I think that it may be easier > to be agnostic to vectorization. > Other than using the Arrow schema/types, I'm not at all convinced that the IR should be Arrow centric. I've actually argued to some that Arrow IR should be independent of Arrow to be its best self. Let's try to review it and see if/where we can avoid a tight coupling between plans and arrow specific concepts. > It also delegates forward/backward compatibility to flatbuffers, when I > think compatibility should be part of the semantics and not delegated to > serialization. For example, if I have Join("inner", a.id, b.id) and I > evolve that to allow additional predicates Join("inner", a.id, b.id, a.x > < b.y) then just because I can deserialize it doesn't mean it is compatible. > I don't think that flatbuffers alone can solve all compatibility problems. It can solve some and I'd expect that implementation libraries will have to solve others. Would love to hear if others disagree (and think flatbuffers can solve everything wrt compatibility). J >