Very promising, congratulations :-)
Le 20/10/2020 à 18:35, Jorge Cardoso Leitão a écrit : > Hi, > > Over the past few weeks I have been running an experiment whose main goal > is to run a query in (Rust's) DataFusion and use Python on it so that we > can embed the Python's ecosystem on the query (a-la pyspark) (details here > <https://github.com/jorgecarleitao/datafusion-python>). > > I am super happy to say that with the code on PR 8401 > <https://github.com/apache/arrow/pull/8401>, I was able to achieve its main > goal: a DataFusion query that uses a Python UDF on it, using the C data > interface to perform zero-copies between C++/Python and Rust. The Python > UDF's signatures are dead simple: f(pa.Array, ...) -> pa.Array. > > This is a pure arrow implementation (no Python dependencies other than > pyarrow) and achieves near optimal execution performance under the > constraints, with the full flexibility of Python and its ecosystem to > perform non-trivial transformations. Folks can of course convert Pyarrow > Arrays to other Python formats within the UDF (e.g. Pandas, or numpy), but > also for interpolation, ML, cuda, etc. > > Finally, the coolest thing about this is that a single execution passes > through most Rust's code base (DataFusion, Arrow, Parquet), but also > through a lot of the Python, C and C++ code base. IMO this reaffirms the > success of this project, on which the different parts stick to the contract > (spec, c data interface) and enable stuff like this. So, big kudos to all > of you! > > Best, > Jorge >