Re: [DISCUSS] [Rust] Python-datafusion

Andy Grove Tue, 04 May 2021 07:09:18 -0700

I apologize. For some reason, I had thought that because Jorge was the only
contributor (except for one contribution fixing a typo in the README) that
the IP clearance process did not apply in this case.


I will create a PR to revert.

On Tue, May 4, 2021 at 8:06 AM Wes McKinney <wesmck...@gmail.com> wrote:

> Just to circle back on this. Since this was an independent codebase
> previously developed over a 10 month period, I had assumed we would be
> looking at an IP clearance vote, but instead it was just merged into
> arrow-datafusion.
>
> On Tue, Apr 27, 2021 at 10:50 AM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
> >
> > Hi Jorge,
> > This all sounds good to me.  It might be nice to test against both the
> > pinned released version of pyarrow and at head if possible.
> >
> > I like the idea of not causing release churn as long as all the
> underlying
> > libraries are compatible.
> >
> > Thanks for the write up.
> >
> > -Micah
> >
> > On Mon, Apr 26, 2021 at 10:30 AM Jorge Cardoso Leitão <
> > jorgecarlei...@gmail.com> wrote:
> >
> > > Hi Micah,
> > >
> > > All testing is actually done from Python: create a record batch in
> > > pyarrow, push it to datafusion,
> > > consume it back in Python, and compare the result using pyarrows'
> > > equality. Sometimes parquet is used instead.
> > > The library is tested against pyarrow==1 from pypi: we can bump that,
> but
> > > if it works in pyarrow==1,
> > > chances are things will improve with higher versions :)
> > >
> > > Releases: I thought to have it released as a separate wheel for two
> > > reasons:
> > >
> > > * not force people that want pyarrow to download datafusion binaries
> with
> > > it
> > > * have independent versioning from pyarrow
> > >
> > > and "bracked" the pyarrow that we ensure compatibility with.
> > >
> > > Another alternative is to release with the same versioning as
> datafusion,
> > > like arrow c++ / pyarrow and spark / pyspark.
> > > The upside is that the versions are aligned. The downside is that we
> will
> > > be releasing a lot of majors for no reason: so far, all backward
> > > incompatible changes in datafusion were not backward incompatible in
> > > python-datafusion: it is easier to break backward compat. in a Rust
> library
> > > than it is in a Python wrapper to a Rust library.
> > >
> > > What are your thoughts, Micah?
> > >
> > > Best,
> > > Jorge
> > >
> > >
> > >
> > >
> > >
> > > On Sun, Apr 25, 2021 at 10:32 PM Micah Kornfield <
> emkornfi...@gmail.com>
> > > wrote:
> > >
> > >> Hi Jorge,
> > >> I think this would certainly be a valuable contribution.  How were you
> > >> thinking of hosting (which repo)/publishing it (maintaintaining a
> separate
> > >> wheel)?  Also did you have thoughts integration testing with pyarrow?
> > >>
> > >> Cheers,
> > >> Micah
> > >>
> > >> On Sun, Apr 25, 2021 at 9:13 AM Jorge Cardoso Leitão <
> > >> jorgecarlei...@gmail.com> wrote:
> > >>
> > >> > Hi,
> > >> >
> > >> > I fielded a PR [1] to open up a discussion to incorporate
> > >> python-datafusion
> > >> > [2] into the Apache Arrow project.
> > >> >
> > >> > Python-datafusion is a Python library [3] built on top of
> DataFusions
> > >> that
> > >> > enables people to use DataFusion from Python. It leverages the C
> data
> > >> > interface for zero-cost copy between DataFusion and pyarrow (a
> bunch of
> > >> > pointers is shared around).
> > >> >
> > >> > For example, it allows users to read a CSV from Rust, pass the
> arrays
> > >> to a
> > >> > C++ kernel, continue the computation in Rust's kernels, and export
> to
> > >> > parquet using Rust (or C++ parquet, or whatever ^_^). It supports
> UDFs
> > >> and
> > >> > UDAFs, in case someone wants to go crazy with Pyarrow, Pandas,
> numpy or
> > >> > tensorflow. =)
> > >> >
> > >> > Best,
> > >> > Jorge
> > >> >
> > >> > [1] https://github.com/apache/arrow-datafusion/pull/69
> > >> > [2] https://github.com/jorgecarleitao/datafusion-python
> > >> > [3] https://pypi.org/project/datafusion/
> > >> >
> > >>
> > >
>

Re: [DISCUSS] [Rust] Python-datafusion

Reply via email to