Re: Experiment with DataFusion + Pyarrow

Wes McKinney Tue, 20 Oct 2020 12:55:29 -0700

Exciting to see, this is exactly the kind of interop we've been
working diligently toward since the start of the project!


On Tue, Oct 20, 2020 at 11:54 AM Micah Kornfield <emkornfi...@gmail.com> wrote:
>
> Really cool work.  Very nice to see this type of integration!
>
> On Tue, Oct 20, 2020 at 9:35 AM Jorge Cardoso Leitão <
> jorgecarlei...@gmail.com> wrote:
>
> > Hi,
> >
> > Over the past few weeks I have been running an experiment whose main goal
> > is to run a query in (Rust's) DataFusion and use Python on it so that we
> > can embed the Python's ecosystem on the query (a-la pyspark) (details here
> > <https://github.com/jorgecarleitao/datafusion-python>).
> >
> > I am super happy to say that with the code on PR 8401
> > <https://github.com/apache/arrow/pull/8401>, I was able to achieve its
> > main
> > goal: a DataFusion query that uses a Python UDF on it, using the C data
> > interface to perform zero-copies between C++/Python and Rust. The Python
> > UDF's signatures are dead simple: f(pa.Array, ...) -> pa.Array.
> >
> > This is a pure arrow implementation (no Python dependencies other than
> > pyarrow) and achieves near optimal execution performance under the
> > constraints, with the full flexibility of Python and its ecosystem to
> > perform non-trivial transformations. Folks can of course convert Pyarrow
> > Arrays to other Python formats within the UDF (e.g. Pandas, or numpy), but
> > also for interpolation, ML, cuda, etc.
> >
> > Finally, the coolest thing about this is that a single execution passes
> > through most Rust's code base (DataFusion, Arrow, Parquet), but also
> > through a lot of the Python, C and C++ code base. IMO this reaffirms the
> > success of this project, on which the different parts stick to the contract
> > (spec, c data interface) and enable stuff like this. So, big kudos to all
> > of you!
> >
> > Best,
> > Jorge
> >

Re: Experiment with DataFusion + Pyarrow

Reply via email to