Hi Jorge, This all sounds good to me. It might be nice to test against both the pinned released version of pyarrow and at head if possible.
I like the idea of not causing release churn as long as all the underlying libraries are compatible. Thanks for the write up. -Micah On Mon, Apr 26, 2021 at 10:30 AM Jorge Cardoso Leitão < jorgecarlei...@gmail.com> wrote: > Hi Micah, > > All testing is actually done from Python: create a record batch in > pyarrow, push it to datafusion, > consume it back in Python, and compare the result using pyarrows' > equality. Sometimes parquet is used instead. > The library is tested against pyarrow==1 from pypi: we can bump that, but > if it works in pyarrow==1, > chances are things will improve with higher versions :) > > Releases: I thought to have it released as a separate wheel for two > reasons: > > * not force people that want pyarrow to download datafusion binaries with > it > * have independent versioning from pyarrow > > and "bracked" the pyarrow that we ensure compatibility with. > > Another alternative is to release with the same versioning as datafusion, > like arrow c++ / pyarrow and spark / pyspark. > The upside is that the versions are aligned. The downside is that we will > be releasing a lot of majors for no reason: so far, all backward > incompatible changes in datafusion were not backward incompatible in > python-datafusion: it is easier to break backward compat. in a Rust library > than it is in a Python wrapper to a Rust library. > > What are your thoughts, Micah? > > Best, > Jorge > > > > > > On Sun, Apr 25, 2021 at 10:32 PM Micah Kornfield <emkornfi...@gmail.com> > wrote: > >> Hi Jorge, >> I think this would certainly be a valuable contribution. How were you >> thinking of hosting (which repo)/publishing it (maintaintaining a separate >> wheel)? Also did you have thoughts integration testing with pyarrow? >> >> Cheers, >> Micah >> >> On Sun, Apr 25, 2021 at 9:13 AM Jorge Cardoso Leitão < >> jorgecarlei...@gmail.com> wrote: >> >> > Hi, >> > >> > I fielded a PR [1] to open up a discussion to incorporate >> python-datafusion >> > [2] into the Apache Arrow project. >> > >> > Python-datafusion is a Python library [3] built on top of DataFusions >> that >> > enables people to use DataFusion from Python. It leverages the C data >> > interface for zero-cost copy between DataFusion and pyarrow (a bunch of >> > pointers is shared around). >> > >> > For example, it allows users to read a CSV from Rust, pass the arrays >> to a >> > C++ kernel, continue the computation in Rust's kernels, and export to >> > parquet using Rust (or C++ parquet, or whatever ^_^). It supports UDFs >> and >> > UDAFs, in case someone wants to go crazy with Pyarrow, Pandas, numpy or >> > tensorflow. =) >> > >> > Best, >> > Jorge >> > >> > [1] https://github.com/apache/arrow-datafusion/pull/69 >> > [2] https://github.com/jorgecarleitao/datafusion-python >> > [3] https://pypi.org/project/datafusion/ >> > >> >