Re: [Rust] [DataFusion] Proposal for User Defined PlanNode / Operator API

Andrew Lamb Sun, 23 Aug 2020 04:56:42 -0700

Cool -- thanks Jorge and Andy. I'll start working on an actual proposed PR
to merge. I really like Jorge's idea of trying to port the Aggregate
operator to use a LogicalPlan Node style and see how easy that would be.


Andrew

On Sat, Aug 22, 2020 at 12:01 PM Andy Grove <andygrov...@gmail.com> wrote:

> I've added some comments as well. I fully support converting LogicalPlan
> from enum to trait. You will notice that I implemented PhysicalPlan using
> traits, because I had more Rust experience at the time I did this work.
>
> There are going to be some design challenges to make this change, I'm sure,
> but I think we can work through them.
>
> Thanks,
>
> Andy.
>
> On Fri, Aug 21, 2020 at 3:37 PM Andrew Lamb <al...@influxdata.com> wrote:
>
> > I would like to propose and request comments from the DataFusion
> community
> > on adding user defined LogicalPlanNodes.
> >
> > A detailed proposal is in this google doc
> > <
> >
> https://docs.google.com/document/d/1IHCGkCuUvnE9BavkykPULn6Ugxgqc1JShT4nz1vMi7g/edit#
> > >
> > (comments
> > enabled, copy/pasted below for your convenience).
> >
> > Here is a PR  showing the approach how it could work:
> > https://github.com/apache/arrow/pull/8020
> > <https://github.com/apache/arrow/pull/8020#pullrequestreview-472829313>
> >
> > Thanks!
> > Andrew
> >
> >
> >
> >
> > Proposal for User Defined PlanNode / Operator API
> >
> > August 21, 2020
> >
> > Andrew Lamb
> >
> > This document is a high level proposal / request for comments from the
> > DataFusion community on adding user defined LogicalPlanNodes.
> > Problem Statement / Rationale
> >
> > We are contemplating building a new query engine for a time series
> related
> > engine using DataFusion. To do so, we will likely need domain specific
> > optimizations which are unlikely to be appropriate for a general purpose
> > engine such as DataFusion because of their specialization.
> >
> > Examples of the kinds of optimizations we are thinking of:
> >
> >    1.
> >
> >    Push down (certain) filter predicates and aggregates into the actual
> >    scan over specialized storage structures.
> >    2.
> >
> >    Specialized time series specific aggregates that rely on order of the
> >    input rows such as first/last.
> >
> > Proposed Solution
> >
> > I propose changing LogicalPlan
> > <
> >
> https://github.com/apache/arrow/blob/master/rust/datafusion/src/logicalplan.rs#L773
> > >
> > nodes from an enum of structs which must be defined in the DataFusion
> > source code to a tree `dyn LogicalPlanNode` trait objects. Here is a PR
> > that demonstrates how such an approach could work:
> > https://github.com/apache/arrow/pull/8020
> >
> > The primary benefit of such a design over the existing enum of structs is
> > that users of the DataFusion library can write their own user defined
> > LogicalPlan nodes and still take advantage of general purpose logic such
> as
> > predicate push down.
> >
> > A downside of this design is that it will isolate the logic for the
> > treatment of each type of LogicalPlanNode into its own module for that
> plan
> > node. This means that algorithms over LogicalPlan nodes (e.g. predicate
> > pushdown) will no longer have any node type specific logic in them which
> > could make them harder to reason about.
> >
> > Prior Work:
> >
> > I am not familiar enough with the Spark codebase to really understand
> > how/if Spark allows this, but I think the high level idea is that the
> > catalyst
> > <
> >
> https://github.com/apache/spark/tree/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst
> > >
> > library provides a similar interfaces called LogicalPlan
> > <
> >
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala#L29
> > >
> > that defines the operators that are available in Spark SQL
> > Alternates considered:
> >
> > One alternate design we could pursue is to extend the existing
> > TableProvider trait to support more sophistication (e.g. predicate
> > pushdown). Adding specialized aggregation operations in the TableProvider
> > trait seems like it would be confusing to most users who didn’t have
> > specialized needs to push partial aggregations into the raw scans
> >
> > Another alternate design that we could use on our project is to maintain
> a
> > fork of the DataFusion code base and simply add our own plan nodes
> directly
> > to our fork. We would prefer to avoid this as it will be more expensive
> to
> > maintain and we think the user defined API would likely be valuable to
> > others in the community
> >
>

Re: [Rust] [DataFusion] Proposal for User Defined PlanNode / Operator API

Reply via email to