Thanks, Jacques and Wes. I agree that this needs discussion and a design document. I have put together this Google doc to get the ball rolling:
https://docs.google.com/document/d/1Uv1FmPs7uYMLoJUH1EF0oxm-ujtz1h1tJFl0zN60TIg/edit?usp=sharing Thanks, Andy. On Mon, Jul 22, 2019 at 6:39 AM Wes McKinney <wesmck...@gmail.com> wrote: > I agree that I'd also like to see a design / goals document so clarify > the scope (and the non-goals, too) > > In general, I would hesitate to add anything higher level to the > Gandiva protos -- there is already confusion from people who believe > that Gandiva is a "query engine" where it is actually a query engine > subsystem (execution kernel compiler/generator). See for example the > thread just a week ago [1] > > If you add higher level query plan structures to the proto file, I > fear it will generate more confusion. If the plan ends up being to > have a larger proto file, it would be good to move it someplace that > isn't Gandiva-specific and clearly indicate that Gandiva is > responsible for code generation for certain structures in the proto. > We can also address some of these issues through better project > documentation and READMEs. > > [1]: > https://lists.apache.org/thread.html/212db05e98549f5938f3af41dade51d7a3e47255178a6c76652adc79@%3Cdev.arrow.apache.org%3E > > On Sun, Jul 21, 2019 at 4:23 PM Jacques Nadeau <jacq...@apache.org> wrote: > > > > Some thoughts: > > > > 1. I think it would make sense to start with a design > > discussion/document about the goals and what we think is > implementation > > specific versus generally applicable. In general, a distributed > execution > > plan seems pretty implementation specific. My sense is that you'd > never run > > a distributed execution plan outside of the knowledge of the > particular > > execution environment it is running within. Part of that is usually > > distributed execution also includes lifecycle management. For > example, if > > you're going to have work-stealing or early termination in your > execution > > engine, those are operations that stitch into execution coordination > (and > > thus a specific impl). If distributed execution is always engine > specific, > > why try to create a general one for multiple engines? > > 2. With regards to making Gandiva protos more generic: I'd like to see > > more clarity on #1. On one hand, extending things so they are reused > is > > good. On the other hand, the more consumers of an interface, the more > > overloads/non-impls you have for each consumer of it. > > > > > > On Sat, Jul 20, 2019 at 10:18 AM Andy Grove <andygrov...@gmail.com> > wrote: > > > > > I recently created a small PoC of distributed query execution on > Kubernetes > > > using the Rust implementation of Apache Arrow and the DataFusion query > > > engine [1]. > > > > > > This PoC uses gRPC to pass query plans to executor nodes and the proto > file > > > [2] is largely based on the Gandiva proto file [3]. The PoC is very > basic > > > but I think it demonstrates the power of having query plans as part of > the > > > proto file. This would allow distributed applications to be built > based on > > > Arrow standards in a way that is not dependent on any particular > > > implementation of Arrow and would even allow mixing and matching query > > > engines. > > > > > > I wanted to start this discussion to see what the appetite is here for > > > accepting PRs to add query plan structures to the Gandiva proto file > and > > > also whether we can consider making this an Arrow proto file rather > than > > > being Gandiva-specific, over time. > > > > > > Thanks, > > > > > > Andy. > > > > > > [1] https://github.com/andygrove/ballista > > > > > > [2] > > > > > > > https://github.com/andygrove/ballista/blob/master/proto/ballista/ballista.proto > > > > > > [3] > > > > > > > https://github.com/apache/arrow/blob/master/cpp/src/gandiva/proto/Types.proto > > > >