I would like to propose and request comments from the DataFusion community
on adding user defined LogicalPlanNodes.

A detailed proposal is in this google doc
<https://docs.google.com/document/d/1IHCGkCuUvnE9BavkykPULn6Ugxgqc1JShT4nz1vMi7g/edit#>
(comments
enabled, copy/pasted below for your convenience).

Here is a PR  showing the approach how it could work:
https://github.com/apache/arrow/pull/8020
<https://github.com/apache/arrow/pull/8020#pullrequestreview-472829313>

Thanks!
Andrew




Proposal for User Defined PlanNode / Operator API

August 21, 2020

Andrew Lamb

This document is a high level proposal / request for comments from the
DataFusion community on adding user defined LogicalPlanNodes.
Problem Statement / Rationale

We are contemplating building a new query engine for a time series related
engine using DataFusion. To do so, we will likely need domain specific
optimizations which are unlikely to be appropriate for a general purpose
engine such as DataFusion because of their specialization.

Examples of the kinds of optimizations we are thinking of:

   1.

   Push down (certain) filter predicates and aggregates into the actual
   scan over specialized storage structures.
   2.

   Specialized time series specific aggregates that rely on order of the
   input rows such as first/last.

Proposed Solution

I propose changing LogicalPlan
<https://github.com/apache/arrow/blob/master/rust/datafusion/src/logicalplan.rs#L773>
nodes from an enum of structs which must be defined in the DataFusion
source code to a tree `dyn LogicalPlanNode` trait objects. Here is a PR
that demonstrates how such an approach could work:
https://github.com/apache/arrow/pull/8020

The primary benefit of such a design over the existing enum of structs is
that users of the DataFusion library can write their own user defined
LogicalPlan nodes and still take advantage of general purpose logic such as
predicate push down.

A downside of this design is that it will isolate the logic for the
treatment of each type of LogicalPlanNode into its own module for that plan
node. This means that algorithms over LogicalPlan nodes (e.g. predicate
pushdown) will no longer have any node type specific logic in them which
could make them harder to reason about.

Prior Work:

I am not familiar enough with the Spark codebase to really understand
how/if Spark allows this, but I think the high level idea is that the
catalyst
<https://github.com/apache/spark/tree/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst>
library provides a similar interfaces called LogicalPlan
<https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala#L29>
that defines the operators that are available in Spark SQL
Alternates considered:

One alternate design we could pursue is to extend the existing
TableProvider trait to support more sophistication (e.g. predicate
pushdown). Adding specialized aggregation operations in the TableProvider
trait seems like it would be confusing to most users who didn’t have
specialized needs to push partial aggregations into the raw scans

Another alternate design that we could use on our project is to maintain a
fork of the DataFusion code base and simply add our own plan nodes directly
to our fork. We would prefer to avoid this as it will be more expensive to
maintain and we think the user defined API would likely be valuable to
others in the community

Reply via email to