Re: Experiment with DataFusion + Pyarrow

Micah Kornfield Tue, 20 Oct 2020 09:55:09 -0700

Really cool work.  Very nice to see this type of integration!

On Tue, Oct 20, 2020 at 9:35 AM Jorge Cardoso Leitão <
jorgecarlei...@gmail.com> wrote:


> Hi,
>
> Over the past few weeks I have been running an experiment whose main goal
> is to run a query in (Rust's) DataFusion and use Python on it so that we
> can embed the Python's ecosystem on the query (a-la pyspark) (details here
> <https://github.com/jorgecarleitao/datafusion-python>).
>
> I am super happy to say that with the code on PR 8401
> <https://github.com/apache/arrow/pull/8401>, I was able to achieve its
> main
> goal: a DataFusion query that uses a Python UDF on it, using the C data
> interface to perform zero-copies between C++/Python and Rust. The Python
> UDF's signatures are dead simple: f(pa.Array, ...) -> pa.Array.
>
> This is a pure arrow implementation (no Python dependencies other than
> pyarrow) and achieves near optimal execution performance under the
> constraints, with the full flexibility of Python and its ecosystem to
> perform non-trivial transformations. Folks can of course convert Pyarrow
> Arrays to other Python formats within the UDF (e.g. Pandas, or numpy), but
> also for interpolation, ML, cuda, etc.
>
> Finally, the coolest thing about this is that a single execution passes
> through most Rust's code base (DataFusion, Arrow, Parquet), but also
> through a lot of the Python, C and C++ code base. IMO this reaffirms the
> success of this project, on which the different parts stick to the contract
> (spec, c data interface) and enable stuff like this. So, big kudos to all
> of you!
>
> Best,
> Jorge
>

Re: Experiment with DataFusion + Pyarrow

Reply via email to