Exciting to see, this is exactly the kind of interop we've been working diligently toward since the start of the project!
On Tue, Oct 20, 2020 at 11:54 AM Micah Kornfield <emkornfi...@gmail.com> wrote: > > Really cool work. Very nice to see this type of integration! > > On Tue, Oct 20, 2020 at 9:35 AM Jorge Cardoso Leitão < > jorgecarlei...@gmail.com> wrote: > > > Hi, > > > > Over the past few weeks I have been running an experiment whose main goal > > is to run a query in (Rust's) DataFusion and use Python on it so that we > > can embed the Python's ecosystem on the query (a-la pyspark) (details here > > <https://github.com/jorgecarleitao/datafusion-python>). > > > > I am super happy to say that with the code on PR 8401 > > <https://github.com/apache/arrow/pull/8401>, I was able to achieve its > > main > > goal: a DataFusion query that uses a Python UDF on it, using the C data > > interface to perform zero-copies between C++/Python and Rust. The Python > > UDF's signatures are dead simple: f(pa.Array, ...) -> pa.Array. > > > > This is a pure arrow implementation (no Python dependencies other than > > pyarrow) and achieves near optimal execution performance under the > > constraints, with the full flexibility of Python and its ecosystem to > > perform non-trivial transformations. Folks can of course convert Pyarrow > > Arrays to other Python formats within the UDF (e.g. Pandas, or numpy), but > > also for interpolation, ML, cuda, etc. > > > > Finally, the coolest thing about this is that a single execution passes > > through most Rust's code base (DataFusion, Arrow, Parquet), but also > > through a lot of the Python, C and C++ code base. IMO this reaffirms the > > success of this project, on which the different parts stick to the contract > > (spec, c data interface) and enable stuff like this. So, big kudos to all > > of you! > > > > Best, > > Jorge > >