I agree that I'd also like to see a design / goals document so clarify
the scope (and the non-goals, too)

In general, I would hesitate to add anything higher level to the
Gandiva protos -- there is already confusion from people who believe
that Gandiva is a "query engine" where it is actually a query engine
subsystem (execution kernel compiler/generator). See for example the
thread just a week ago [1]

If you add higher level query plan structures to the proto file, I
fear it will generate more confusion. If the plan ends up being to
have a larger proto file, it would be good to move it someplace that
isn't Gandiva-specific and clearly indicate that Gandiva is
responsible for code generation for certain structures in the proto.
We can also address some of these issues through better project
documentation and READMEs.

[1]: 
https://lists.apache.org/thread.html/212db05e98549f5938f3af41dade51d7a3e47255178a6c76652adc79@%3Cdev.arrow.apache.org%3E

On Sun, Jul 21, 2019 at 4:23 PM Jacques Nadeau <jacq...@apache.org> wrote:
>
> Some thoughts:
>
>    1. I think it would make sense to start with a design
>    discussion/document about the goals and what we think is implementation
>    specific versus generally applicable. In general, a distributed execution
>    plan seems pretty implementation specific. My sense is that you'd never run
>    a distributed execution plan outside of the knowledge of the particular
>    execution environment it is running within. Part of that is usually
>    distributed execution also includes lifecycle management. For example, if
>    you're going to have work-stealing  or early termination in your execution
>    engine, those are operations that stitch into execution coordination (and
>    thus a specific impl). If distributed execution is always engine specific,
>    why try to create a general one for multiple engines?
>    2. With regards to making Gandiva protos more generic: I'd like to see
>    more clarity on #1. On one hand, extending things so they are reused is
>    good. On the other hand, the more consumers of an interface, the more
>    overloads/non-impls you have for each consumer of it.
>
>
> On Sat, Jul 20, 2019 at 10:18 AM Andy Grove <andygrov...@gmail.com> wrote:
>
> > I recently created a small PoC of distributed query execution on Kubernetes
> > using the Rust implementation of Apache Arrow and the DataFusion query
> > engine [1].
> >
> > This PoC uses gRPC to pass query plans to executor nodes and the proto file
> > [2] is largely based on the Gandiva proto file [3]. The PoC is very basic
> > but I think it demonstrates the power of having query plans as part of the
> > proto file. This would allow distributed applications to be built based on
> > Arrow standards in a way that is not dependent on any particular
> > implementation of Arrow and would even allow mixing and matching query
> > engines.
> >
> > I wanted to start this discussion to see what the appetite is here for
> > accepting PRs to add query plan structures to the Gandiva proto file and
> > also whether we can consider making this an Arrow proto file rather than
> > being Gandiva-specific, over time.
> >
> > Thanks,
> >
> > Andy.
> >
> > [1] https://github.com/andygrove/ballista
> >
> > [2]
> >
> > https://github.com/andygrove/ballista/blob/master/proto/ballista/ballista.proto
> >
> > [3]
> >
> > https://github.com/apache/arrow/blob/master/cpp/src/gandiva/proto/Types.proto
> >

Reply via email to