Thanks, Jacques and Wes.

I agree that this needs discussion and a design document. I have put
together this Google doc to get the ball rolling:

https://docs.google.com/document/d/1Uv1FmPs7uYMLoJUH1EF0oxm-ujtz1h1tJFl0zN60TIg/edit?usp=sharing

Thanks,

Andy.

On Mon, Jul 22, 2019 at 6:39 AM Wes McKinney <wesmck...@gmail.com> wrote:

> I agree that I'd also like to see a design / goals document so clarify
> the scope (and the non-goals, too)
>
> In general, I would hesitate to add anything higher level to the
> Gandiva protos -- there is already confusion from people who believe
> that Gandiva is a "query engine" where it is actually a query engine
> subsystem (execution kernel compiler/generator). See for example the
> thread just a week ago [1]
>
> If you add higher level query plan structures to the proto file, I
> fear it will generate more confusion. If the plan ends up being to
> have a larger proto file, it would be good to move it someplace that
> isn't Gandiva-specific and clearly indicate that Gandiva is
> responsible for code generation for certain structures in the proto.
> We can also address some of these issues through better project
> documentation and READMEs.
>
> [1]:
> https://lists.apache.org/thread.html/212db05e98549f5938f3af41dade51d7a3e47255178a6c76652adc79@%3Cdev.arrow.apache.org%3E
>
> On Sun, Jul 21, 2019 at 4:23 PM Jacques Nadeau <jacq...@apache.org> wrote:
> >
> > Some thoughts:
> >
> >    1. I think it would make sense to start with a design
> >    discussion/document about the goals and what we think is
> implementation
> >    specific versus generally applicable. In general, a distributed
> execution
> >    plan seems pretty implementation specific. My sense is that you'd
> never run
> >    a distributed execution plan outside of the knowledge of the
> particular
> >    execution environment it is running within. Part of that is usually
> >    distributed execution also includes lifecycle management. For
> example, if
> >    you're going to have work-stealing  or early termination in your
> execution
> >    engine, those are operations that stitch into execution coordination
> (and
> >    thus a specific impl). If distributed execution is always engine
> specific,
> >    why try to create a general one for multiple engines?
> >    2. With regards to making Gandiva protos more generic: I'd like to see
> >    more clarity on #1. On one hand, extending things so they are reused
> is
> >    good. On the other hand, the more consumers of an interface, the more
> >    overloads/non-impls you have for each consumer of it.
> >
> >
> > On Sat, Jul 20, 2019 at 10:18 AM Andy Grove <andygrov...@gmail.com>
> wrote:
> >
> > > I recently created a small PoC of distributed query execution on
> Kubernetes
> > > using the Rust implementation of Apache Arrow and the DataFusion query
> > > engine [1].
> > >
> > > This PoC uses gRPC to pass query plans to executor nodes and the proto
> file
> > > [2] is largely based on the Gandiva proto file [3]. The PoC is very
> basic
> > > but I think it demonstrates the power of having query plans as part of
> the
> > > proto file. This would allow distributed applications to be built
> based on
> > > Arrow standards in a way that is not dependent on any particular
> > > implementation of Arrow and would even allow mixing and matching query
> > > engines.
> > >
> > > I wanted to start this discussion to see what the appetite is here for
> > > accepting PRs to add query plan structures to the Gandiva proto file
> and
> > > also whether we can consider making this an Arrow proto file rather
> than
> > > being Gandiva-specific, over time.
> > >
> > > Thanks,
> > >
> > > Andy.
> > >
> > > [1] https://github.com/andygrove/ballista
> > >
> > > [2]
> > >
> > >
> https://github.com/andygrove/ballista/blob/master/proto/ballista/ballista.proto
> > >
> > > [3]
> > >
> > >
> https://github.com/apache/arrow/blob/master/cpp/src/gandiva/proto/Types.proto
> > >
>

Reply via email to