Re: Experiment with DataFusion + Pyarrow

Antoine Pitrou Wed, 21 Oct 2020 09:18:21 -0700


Very promising, congratulations :-)



Le 20/10/2020 à 18:35, Jorge Cardoso Leitão a écrit :
> Hi,
> 
> Over the past few weeks I have been running an experiment whose main goal
> is to run a query in (Rust's) DataFusion and use Python on it so that we
> can embed the Python's ecosystem on the query (a-la pyspark) (details here
> <https://github.com/jorgecarleitao/datafusion-python>).
> 
> I am super happy to say that with the code on PR 8401
> <https://github.com/apache/arrow/pull/8401>, I was able to achieve its main
> goal: a DataFusion query that uses a Python UDF on it, using the C data
> interface to perform zero-copies between C++/Python and Rust. The Python
> UDF's signatures are dead simple: f(pa.Array, ...) -> pa.Array.
> 
> This is a pure arrow implementation (no Python dependencies other than
> pyarrow) and achieves near optimal execution performance under the
> constraints, with the full flexibility of Python and its ecosystem to
> perform non-trivial transformations. Folks can of course convert Pyarrow
> Arrays to other Python formats within the UDF (e.g. Pandas, or numpy), but
> also for interpolation, ML, cuda, etc.
> 
> Finally, the coolest thing about this is that a single execution passes
> through most Rust's code base (DataFusion, Arrow, Parquet), but also
> through a lot of the Python, C and C++ code base. IMO this reaffirms the
> success of this project, on which the different parts stick to the contract
> (spec, c data interface) and enable stuff like this. So, big kudos to all
> of you!
> 
> Best,
> Jorge
>

Re: Experiment with DataFusion + Pyarrow

Reply via email to