On Thu, Aug 26, 2021 at 2:44 PM Ryan Blue <b...@tabular.io> wrote:

> Would a physical plan be portable for the purpose of an engine-agnostic
> view?
>

My goal is it would be. There may be optional "hints" that a particular
engine could leverage and others wouldn't but I think the goal should be
that the IR is entirely engine-agnostic. Even in the Arrow project proper,
there are really two independent heavy-weight engines that have their own
capabilities and trajectories (c++ vs rust).


> Physical plan details seem specific to an engine to me, but maybe I'm
> thinking too much about how Spark is implemented. My inclination would be
> to accept only logical IR, which could just mean accepting a subset of the
> standard.
>

I think it is very likely that different consumers will only support a
subset of plans. That being said, I'm not sure what you're specifically
trying to mitigate or avoid. I'd be inclined to simply allow the full
breadth of IR within Iceberg. If it is well specified, an engine can either
choose to execute or not (same as the proposal wrt to SQL syntax or if a
function is missing on an engine). The engine may even have internal
rewrites if it likes doing things a different way than what is requested.


> The document that Micah linked to is interesting, but I'm not sure that
> our goals are aligned.
>

I think there is much commonality here and I'd argue it would be best to
really try to see if a unified set of goals works well. I think Arrow IR is
young enough that it can still be shaped/adapted. It may be that there
should be some give or take on each side. It's possible that the goals are
too far apart to unify but my gut is that they are close enough that we
should try since it would be a great force multiplier.


> For one thing, it seems to make assumptions about the IR being used for
> Arrow data (at least in Wes' proposal), when I think that it may be easier
> to be agnostic to vectorization.
>

Other than using the Arrow schema/types, I'm not at all convinced that the
IR should be Arrow centric. I've actually argued to some that Arrow IR
should be independent of Arrow to be its best self. Let's try to review it
and see if/where we can avoid a tight coupling between plans and arrow
specific concepts.


> It also delegates forward/backward compatibility to flatbuffers, when I
> think compatibility should be part of the semantics and not delegated to
> serialization. For example, if I have Join("inner", a.id, b.id) and I
> evolve that to allow additional predicates Join("inner", a.id, b.id, a.x
> < b.y) then just because I can deserialize it doesn't mean it is compatible.
>

I don't think that flatbuffers alone can solve all compatibility problems.
It can solve some and I'd expect that implementation libraries will have to
solve others. Would love to hear if others disagree (and think flatbuffers
can solve everything wrt compatibility).

J

>

Reply via email to