I would like to propose and request comments from the DataFusion community on adding user defined LogicalPlanNodes.
A detailed proposal is in this google doc <https://docs.google.com/document/d/1IHCGkCuUvnE9BavkykPULn6Ugxgqc1JShT4nz1vMi7g/edit#> (comments enabled, copy/pasted below for your convenience). Here is a PR showing the approach how it could work: https://github.com/apache/arrow/pull/8020 <https://github.com/apache/arrow/pull/8020#pullrequestreview-472829313> Thanks! Andrew Proposal for User Defined PlanNode / Operator API August 21, 2020 Andrew Lamb This document is a high level proposal / request for comments from the DataFusion community on adding user defined LogicalPlanNodes. Problem Statement / Rationale We are contemplating building a new query engine for a time series related engine using DataFusion. To do so, we will likely need domain specific optimizations which are unlikely to be appropriate for a general purpose engine such as DataFusion because of their specialization. Examples of the kinds of optimizations we are thinking of: 1. Push down (certain) filter predicates and aggregates into the actual scan over specialized storage structures. 2. Specialized time series specific aggregates that rely on order of the input rows such as first/last. Proposed Solution I propose changing LogicalPlan <https://github.com/apache/arrow/blob/master/rust/datafusion/src/logicalplan.rs#L773> nodes from an enum of structs which must be defined in the DataFusion source code to a tree `dyn LogicalPlanNode` trait objects. Here is a PR that demonstrates how such an approach could work: https://github.com/apache/arrow/pull/8020 The primary benefit of such a design over the existing enum of structs is that users of the DataFusion library can write their own user defined LogicalPlan nodes and still take advantage of general purpose logic such as predicate push down. A downside of this design is that it will isolate the logic for the treatment of each type of LogicalPlanNode into its own module for that plan node. This means that algorithms over LogicalPlan nodes (e.g. predicate pushdown) will no longer have any node type specific logic in them which could make them harder to reason about. Prior Work: I am not familiar enough with the Spark codebase to really understand how/if Spark allows this, but I think the high level idea is that the catalyst <https://github.com/apache/spark/tree/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst> library provides a similar interfaces called LogicalPlan <https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala#L29> that defines the operators that are available in Spark SQL Alternates considered: One alternate design we could pursue is to extend the existing TableProvider trait to support more sophistication (e.g. predicate pushdown). Adding specialized aggregation operations in the TableProvider trait seems like it would be confusing to most users who didn’t have specialized needs to push partial aggregations into the raw scans Another alternate design that we could use on our project is to maintain a fork of the DataFusion code base and simply add our own plan nodes directly to our fork. We would prefer to avoid this as it will be more expensive to maintain and we think the user defined API would likely be valuable to others in the community