Hi Jorge,
This all sounds good to me.  It might be nice to test against both the
pinned released version of pyarrow and at head if possible.

I like the idea of not causing release churn as long as all the underlying
libraries are compatible.

Thanks for the write up.

-Micah

On Mon, Apr 26, 2021 at 10:30 AM Jorge Cardoso Leitão <
jorgecarlei...@gmail.com> wrote:

> Hi Micah,
>
> All testing is actually done from Python: create a record batch in
> pyarrow, push it to datafusion,
> consume it back in Python, and compare the result using pyarrows'
> equality. Sometimes parquet is used instead.
> The library is tested against pyarrow==1 from pypi: we can bump that, but
> if it works in pyarrow==1,
> chances are things will improve with higher versions :)
>
> Releases: I thought to have it released as a separate wheel for two
> reasons:
>
> * not force people that want pyarrow to download datafusion binaries with
> it
> * have independent versioning from pyarrow
>
> and "bracked" the pyarrow that we ensure compatibility with.
>
> Another alternative is to release with the same versioning as datafusion,
> like arrow c++ / pyarrow and spark / pyspark.
> The upside is that the versions are aligned. The downside is that we will
> be releasing a lot of majors for no reason: so far, all backward
> incompatible changes in datafusion were not backward incompatible in
> python-datafusion: it is easier to break backward compat. in a Rust library
> than it is in a Python wrapper to a Rust library.
>
> What are your thoughts, Micah?
>
> Best,
> Jorge
>
>
>
>
>
> On Sun, Apr 25, 2021 at 10:32 PM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
>
>> Hi Jorge,
>> I think this would certainly be a valuable contribution.  How were you
>> thinking of hosting (which repo)/publishing it (maintaintaining a separate
>> wheel)?  Also did you have thoughts integration testing with pyarrow?
>>
>> Cheers,
>> Micah
>>
>> On Sun, Apr 25, 2021 at 9:13 AM Jorge Cardoso Leitão <
>> jorgecarlei...@gmail.com> wrote:
>>
>> > Hi,
>> >
>> > I fielded a PR [1] to open up a discussion to incorporate
>> python-datafusion
>> > [2] into the Apache Arrow project.
>> >
>> > Python-datafusion is a Python library [3] built on top of DataFusions
>> that
>> > enables people to use DataFusion from Python. It leverages the C data
>> > interface for zero-cost copy between DataFusion and pyarrow (a bunch of
>> > pointers is shared around).
>> >
>> > For example, it allows users to read a CSV from Rust, pass the arrays
>> to a
>> > C++ kernel, continue the computation in Rust's kernels, and export to
>> > parquet using Rust (or C++ parquet, or whatever ^_^). It supports UDFs
>> and
>> > UDAFs, in case someone wants to go crazy with Pyarrow, Pandas, numpy or
>> > tensorflow. =)
>> >
>> > Best,
>> > Jorge
>> >
>> > [1] https://github.com/apache/arrow-datafusion/pull/69
>> > [2] https://github.com/jorgecarleitao/datafusion-python
>> > [3] https://pypi.org/project/datafusion/
>> >
>>
>

Reply via email to