Hi Micah,

All testing is actually done from Python: create a record batch in pyarrow,
push it to datafusion,
consume it back in Python, and compare the result using pyarrows' equality.
Sometimes parquet is used instead.
The library is tested against pyarrow==1 from pypi: we can bump that, but
if it works in pyarrow==1,
chances are things will improve with higher versions :)

Releases: I thought to have it released as a separate wheel for two reasons:

* not force people that want pyarrow to download datafusion binaries with it
* have independent versioning from pyarrow

and "bracked" the pyarrow that we ensure compatibility with.

Another alternative is to release with the same versioning as datafusion,
like arrow c++ / pyarrow and spark / pyspark.
The upside is that the versions are aligned. The downside is that we will
be releasing a lot of majors for no reason: so far, all backward
incompatible changes in datafusion were not backward incompatible in
python-datafusion: it is easier to break backward compat. in a Rust library
than it is in a Python wrapper to a Rust library.

What are your thoughts, Micah?

Best,
Jorge





On Sun, Apr 25, 2021 at 10:32 PM Micah Kornfield <emkornfi...@gmail.com>
wrote:

> Hi Jorge,
> I think this would certainly be a valuable contribution.  How were you
> thinking of hosting (which repo)/publishing it (maintaintaining a separate
> wheel)?  Also did you have thoughts integration testing with pyarrow?
>
> Cheers,
> Micah
>
> On Sun, Apr 25, 2021 at 9:13 AM Jorge Cardoso Leitão <
> jorgecarlei...@gmail.com> wrote:
>
> > Hi,
> >
> > I fielded a PR [1] to open up a discussion to incorporate
> python-datafusion
> > [2] into the Apache Arrow project.
> >
> > Python-datafusion is a Python library [3] built on top of DataFusions
> that
> > enables people to use DataFusion from Python. It leverages the C data
> > interface for zero-cost copy between DataFusion and pyarrow (a bunch of
> > pointers is shared around).
> >
> > For example, it allows users to read a CSV from Rust, pass the arrays to
> a
> > C++ kernel, continue the computation in Rust's kernels, and export to
> > parquet using Rust (or C++ parquet, or whatever ^_^). It supports UDFs
> and
> > UDAFs, in case someone wants to go crazy with Pyarrow, Pandas, numpy or
> > tensorflow. =)
> >
> > Best,
> > Jorge
> >
> > [1] https://github.com/apache/arrow-datafusion/pull/69
> > [2] https://github.com/jorgecarleitao/datafusion-python
> > [3] https://pypi.org/project/datafusion/
> >
>

Reply via email to