Re: [DISCUSS] [Rust] Python-datafusion

Wes McKinney Tue, 04 May 2021 07:05:57 -0700

Just to circle back on this. Since this was an independent codebase
previously developed over a 10 month period, I had assumed we would be
looking at an IP clearance vote, but instead it was just merged into
arrow-datafusion.


On Tue, Apr 27, 2021 at 10:50 AM Micah Kornfield <[email protected]> wrote:
>
> Hi Jorge,
> This all sounds good to me.  It might be nice to test against both the
> pinned released version of pyarrow and at head if possible.
>
> I like the idea of not causing release churn as long as all the underlying
> libraries are compatible.
>
> Thanks for the write up.
>
> -Micah
>
> On Mon, Apr 26, 2021 at 10:30 AM Jorge Cardoso Leitão <
> [email protected]> wrote:
>
> > Hi Micah,
> >
> > All testing is actually done from Python: create a record batch in
> > pyarrow, push it to datafusion,
> > consume it back in Python, and compare the result using pyarrows'
> > equality. Sometimes parquet is used instead.
> > The library is tested against pyarrow==1 from pypi: we can bump that, but
> > if it works in pyarrow==1,
> > chances are things will improve with higher versions :)
> >
> > Releases: I thought to have it released as a separate wheel for two
> > reasons:
> >
> > * not force people that want pyarrow to download datafusion binaries with
> > it
> > * have independent versioning from pyarrow
> >
> > and "bracked" the pyarrow that we ensure compatibility with.
> >
> > Another alternative is to release with the same versioning as datafusion,
> > like arrow c++ / pyarrow and spark / pyspark.
> > The upside is that the versions are aligned. The downside is that we will
> > be releasing a lot of majors for no reason: so far, all backward
> > incompatible changes in datafusion were not backward incompatible in
> > python-datafusion: it is easier to break backward compat. in a Rust library
> > than it is in a Python wrapper to a Rust library.
> >
> > What are your thoughts, Micah?
> >
> > Best,
> > Jorge
> >
> >
> >
> >
> >
> > On Sun, Apr 25, 2021 at 10:32 PM Micah Kornfield <[email protected]>
> > wrote:
> >
> >> Hi Jorge,
> >> I think this would certainly be a valuable contribution.  How were you
> >> thinking of hosting (which repo)/publishing it (maintaintaining a separate
> >> wheel)?  Also did you have thoughts integration testing with pyarrow?
> >>
> >> Cheers,
> >> Micah
> >>
> >> On Sun, Apr 25, 2021 at 9:13 AM Jorge Cardoso Leitão <
> >> [email protected]> wrote:
> >>
> >> > Hi,
> >> >
> >> > I fielded a PR [1] to open up a discussion to incorporate
> >> python-datafusion
> >> > [2] into the Apache Arrow project.
> >> >
> >> > Python-datafusion is a Python library [3] built on top of DataFusions
> >> that
> >> > enables people to use DataFusion from Python. It leverages the C data
> >> > interface for zero-cost copy between DataFusion and pyarrow (a bunch of
> >> > pointers is shared around).
> >> >
> >> > For example, it allows users to read a CSV from Rust, pass the arrays
> >> to a
> >> > C++ kernel, continue the computation in Rust's kernels, and export to
> >> > parquet using Rust (or C++ parquet, or whatever ^_^). It supports UDFs
> >> and
> >> > UDAFs, in case someone wants to go crazy with Pyarrow, Pandas, numpy or
> >> > tensorflow. =)
> >> >
> >> > Best,
> >> > Jorge
> >> >
> >> > [1] https://github.com/apache/arrow-datafusion/pull/69
> >> > [2] https://github.com/jorgecarleitao/datafusion-python
> >> > [3] https://pypi.org/project/datafusion/
> >> >
> >>
> >

Re: [DISCUSS] [Rust] Python-datafusion

Reply via email to