Re: [Rust] [DataFusion] Proposal for User Defined PlanNode / Operator API

Andy Grove Sat, 22 Aug 2020 09:01:56 -0700

I've added some comments as well. I fully support converting LogicalPlan
from enum to trait. You will notice that I implemented PhysicalPlan using
traits, because I had more Rust experience at the time I did this work.


There are going to be some design challenges to make this change, I'm sure,
but I think we can work through them.

Thanks,

Andy.

On Fri, Aug 21, 2020 at 3:37 PM Andrew Lamb <[email protected]> wrote:

> I would like to propose and request comments from the DataFusion community
> on adding user defined LogicalPlanNodes.
>
> A detailed proposal is in this google doc
> <
> https://docs.google.com/document/d/1IHCGkCuUvnE9BavkykPULn6Ugxgqc1JShT4nz1vMi7g/edit#
> >
> (comments
> enabled, copy/pasted below for your convenience).
>
> Here is a PR  showing the approach how it could work:
> https://github.com/apache/arrow/pull/8020
> <https://github.com/apache/arrow/pull/8020#pullrequestreview-472829313>
>
> Thanks!
> Andrew
>
>
>
>
> Proposal for User Defined PlanNode / Operator API
>
> August 21, 2020
>
> Andrew Lamb
>
> This document is a high level proposal / request for comments from the
> DataFusion community on adding user defined LogicalPlanNodes.
> Problem Statement / Rationale
>
> We are contemplating building a new query engine for a time series related
> engine using DataFusion. To do so, we will likely need domain specific
> optimizations which are unlikely to be appropriate for a general purpose
> engine such as DataFusion because of their specialization.
>
> Examples of the kinds of optimizations we are thinking of:
>
>    1.
>
>    Push down (certain) filter predicates and aggregates into the actual
>    scan over specialized storage structures.
>    2.
>
>    Specialized time series specific aggregates that rely on order of the
>    input rows such as first/last.
>
> Proposed Solution
>
> I propose changing LogicalPlan
> <
> https://github.com/apache/arrow/blob/master/rust/datafusion/src/logicalplan.rs#L773
> >
> nodes from an enum of structs which must be defined in the DataFusion
> source code to a tree `dyn LogicalPlanNode` trait objects. Here is a PR
> that demonstrates how such an approach could work:
> https://github.com/apache/arrow/pull/8020
>
> The primary benefit of such a design over the existing enum of structs is
> that users of the DataFusion library can write their own user defined
> LogicalPlan nodes and still take advantage of general purpose logic such as
> predicate push down.
>
> A downside of this design is that it will isolate the logic for the
> treatment of each type of LogicalPlanNode into its own module for that plan
> node. This means that algorithms over LogicalPlan nodes (e.g. predicate
> pushdown) will no longer have any node type specific logic in them which
> could make them harder to reason about.
>
> Prior Work:
>
> I am not familiar enough with the Spark codebase to really understand
> how/if Spark allows this, but I think the high level idea is that the
> catalyst
> <
> https://github.com/apache/spark/tree/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst
> >
> library provides a similar interfaces called LogicalPlan
> <
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala#L29
> >
> that defines the operators that are available in Spark SQL
> Alternates considered:
>
> One alternate design we could pursue is to extend the existing
> TableProvider trait to support more sophistication (e.g. predicate
> pushdown). Adding specialized aggregation operations in the TableProvider
> trait seems like it would be confusing to most users who didn’t have
> specialized needs to push partial aggregations into the raw scans
>
> Another alternate design that we could use on our project is to maintain a
> fork of the DataFusion code base and simply add our own plan nodes directly
> to our fork. We would prefer to avoid this as it will be more expensive to
> maintain and we think the user defined API would likely be valuable to
> others in the community
>

Re: [Rust] [DataFusion] Proposal for User Defined PlanNode / Operator API

Reply via email to