Cool -- thanks Jorge and Andy. I'll start working on an actual proposed PR to merge. I really like Jorge's idea of trying to port the Aggregate operator to use a LogicalPlan Node style and see how easy that would be.
Andrew On Sat, Aug 22, 2020 at 12:01 PM Andy Grove <andygrov...@gmail.com> wrote: > I've added some comments as well. I fully support converting LogicalPlan > from enum to trait. You will notice that I implemented PhysicalPlan using > traits, because I had more Rust experience at the time I did this work. > > There are going to be some design challenges to make this change, I'm sure, > but I think we can work through them. > > Thanks, > > Andy. > > On Fri, Aug 21, 2020 at 3:37 PM Andrew Lamb <al...@influxdata.com> wrote: > > > I would like to propose and request comments from the DataFusion > community > > on adding user defined LogicalPlanNodes. > > > > A detailed proposal is in this google doc > > < > > > https://docs.google.com/document/d/1IHCGkCuUvnE9BavkykPULn6Ugxgqc1JShT4nz1vMi7g/edit# > > > > > (comments > > enabled, copy/pasted below for your convenience). > > > > Here is a PR showing the approach how it could work: > > https://github.com/apache/arrow/pull/8020 > > <https://github.com/apache/arrow/pull/8020#pullrequestreview-472829313> > > > > Thanks! > > Andrew > > > > > > > > > > Proposal for User Defined PlanNode / Operator API > > > > August 21, 2020 > > > > Andrew Lamb > > > > This document is a high level proposal / request for comments from the > > DataFusion community on adding user defined LogicalPlanNodes. > > Problem Statement / Rationale > > > > We are contemplating building a new query engine for a time series > related > > engine using DataFusion. To do so, we will likely need domain specific > > optimizations which are unlikely to be appropriate for a general purpose > > engine such as DataFusion because of their specialization. > > > > Examples of the kinds of optimizations we are thinking of: > > > > 1. > > > > Push down (certain) filter predicates and aggregates into the actual > > scan over specialized storage structures. > > 2. > > > > Specialized time series specific aggregates that rely on order of the > > input rows such as first/last. > > > > Proposed Solution > > > > I propose changing LogicalPlan > > < > > > https://github.com/apache/arrow/blob/master/rust/datafusion/src/logicalplan.rs#L773 > > > > > nodes from an enum of structs which must be defined in the DataFusion > > source code to a tree `dyn LogicalPlanNode` trait objects. Here is a PR > > that demonstrates how such an approach could work: > > https://github.com/apache/arrow/pull/8020 > > > > The primary benefit of such a design over the existing enum of structs is > > that users of the DataFusion library can write their own user defined > > LogicalPlan nodes and still take advantage of general purpose logic such > as > > predicate push down. > > > > A downside of this design is that it will isolate the logic for the > > treatment of each type of LogicalPlanNode into its own module for that > plan > > node. This means that algorithms over LogicalPlan nodes (e.g. predicate > > pushdown) will no longer have any node type specific logic in them which > > could make them harder to reason about. > > > > Prior Work: > > > > I am not familiar enough with the Spark codebase to really understand > > how/if Spark allows this, but I think the high level idea is that the > > catalyst > > < > > > https://github.com/apache/spark/tree/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst > > > > > library provides a similar interfaces called LogicalPlan > > < > > > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala#L29 > > > > > that defines the operators that are available in Spark SQL > > Alternates considered: > > > > One alternate design we could pursue is to extend the existing > > TableProvider trait to support more sophistication (e.g. predicate > > pushdown). Adding specialized aggregation operations in the TableProvider > > trait seems like it would be confusing to most users who didn’t have > > specialized needs to push partial aggregations into the raw scans > > > > Another alternate design that we could use on our project is to maintain > a > > fork of the DataFusion code base and simply add our own plan nodes > directly > > to our fork. We would prefer to avoid this as it will be more expensive > to > > maintain and we think the user defined API would likely be valuable to > > others in the community > > >