Re: [DISCUSS] [Rust] Python-datafusion

Wes McKinney Tue, 04 May 2021 07:36:08 -0700

I admit it's an unusual situation to have a single-author codebase
where the developer is on the PMC, let's determine what is the
protocol for this kind of thing in the future so we don't create
unnecessary work for ourselves.


On Tue, May 4, 2021 at 9:15 AM Andy Grove <[email protected]> wrote:
>
> I apologize. For some reason, I had thought that because Jorge was the only
> contributor (except for one contribution fixing a typo in the README) that
> the IP clearance process did not apply in this case.
>
> I will create a PR to revert.
>
> On Tue, May 4, 2021 at 8:06 AM Wes McKinney <[email protected]> wrote:
>
> > Just to circle back on this. Since this was an independent codebase
> > previously developed over a 10 month period, I had assumed we would be
> > looking at an IP clearance vote, but instead it was just merged into
> > arrow-datafusion.
> >
> > On Tue, Apr 27, 2021 at 10:50 AM Micah Kornfield <[email protected]>
> > wrote:
> > >
> > > Hi Jorge,
> > > This all sounds good to me.  It might be nice to test against both the
> > > pinned released version of pyarrow and at head if possible.
> > >
> > > I like the idea of not causing release churn as long as all the
> > underlying
> > > libraries are compatible.
> > >
> > > Thanks for the write up.
> > >
> > > -Micah
> > >
> > > On Mon, Apr 26, 2021 at 10:30 AM Jorge Cardoso Leitão <
> > > [email protected]> wrote:
> > >
> > > > Hi Micah,
> > > >
> > > > All testing is actually done from Python: create a record batch in
> > > > pyarrow, push it to datafusion,
> > > > consume it back in Python, and compare the result using pyarrows'
> > > > equality. Sometimes parquet is used instead.
> > > > The library is tested against pyarrow==1 from pypi: we can bump that,
> > but
> > > > if it works in pyarrow==1,
> > > > chances are things will improve with higher versions :)
> > > >
> > > > Releases: I thought to have it released as a separate wheel for two
> > > > reasons:
> > > >
> > > > * not force people that want pyarrow to download datafusion binaries
> > with
> > > > it
> > > > * have independent versioning from pyarrow
> > > >
> > > > and "bracked" the pyarrow that we ensure compatibility with.
> > > >
> > > > Another alternative is to release with the same versioning as
> > datafusion,
> > > > like arrow c++ / pyarrow and spark / pyspark.
> > > > The upside is that the versions are aligned. The downside is that we
> > will
> > > > be releasing a lot of majors for no reason: so far, all backward
> > > > incompatible changes in datafusion were not backward incompatible in
> > > > python-datafusion: it is easier to break backward compat. in a Rust
> > library
> > > > than it is in a Python wrapper to a Rust library.
> > > >
> > > > What are your thoughts, Micah?
> > > >
> > > > Best,
> > > > Jorge
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On Sun, Apr 25, 2021 at 10:32 PM Micah Kornfield <
> > [email protected]>
> > > > wrote:
> > > >
> > > >> Hi Jorge,
> > > >> I think this would certainly be a valuable contribution.  How were you
> > > >> thinking of hosting (which repo)/publishing it (maintaintaining a
> > separate
> > > >> wheel)?  Also did you have thoughts integration testing with pyarrow?
> > > >>
> > > >> Cheers,
> > > >> Micah
> > > >>
> > > >> On Sun, Apr 25, 2021 at 9:13 AM Jorge Cardoso Leitão <
> > > >> [email protected]> wrote:
> > > >>
> > > >> > Hi,
> > > >> >
> > > >> > I fielded a PR [1] to open up a discussion to incorporate
> > > >> python-datafusion
> > > >> > [2] into the Apache Arrow project.
> > > >> >
> > > >> > Python-datafusion is a Python library [3] built on top of
> > DataFusions
> > > >> that
> > > >> > enables people to use DataFusion from Python. It leverages the C
> > data
> > > >> > interface for zero-cost copy between DataFusion and pyarrow (a
> > bunch of
> > > >> > pointers is shared around).
> > > >> >
> > > >> > For example, it allows users to read a CSV from Rust, pass the
> > arrays
> > > >> to a
> > > >> > C++ kernel, continue the computation in Rust's kernels, and export
> > to
> > > >> > parquet using Rust (or C++ parquet, or whatever ^_^). It supports
> > UDFs
> > > >> and
> > > >> > UDAFs, in case someone wants to go crazy with Pyarrow, Pandas,
> > numpy or
> > > >> > tensorflow. =)
> > > >> >
> > > >> > Best,
> > > >> > Jorge
> > > >> >
> > > >> > [1] https://github.com/apache/arrow-datafusion/pull/69
> > > >> > [2] https://github.com/jorgecarleitao/datafusion-python
> > > >> > [3] https://pypi.org/project/datafusion/
> > > >> >
> > > >>
> > > >
> >

Re: [DISCUSS] [Rust] Python-datafusion

Reply via email to