Re: Arrow Datasets Functionality for Python

Wes McKinney Tue, 18 Feb 2020 00:31:14 -0800

hi Matthew,

Thanks -- our contribution workflow is roughly to define JIRA tickets
and then submit pull requests. Adding documentation or examples is
also helpful


When there is uncertainty about the scope of a ticket or the solution
approach, feel free to ask questions and we will try to provide
feedback as able.

- Wes

On Mon, Feb 17, 2020 at 9:06 PM Matthew Turner
<matthew.m.tur...@outlook.com> wrote:
>
> Hi Francois,
>
> Thanks for the response - the explanation definitely helped and I will review 
> the provided documents.
>
> Hi Wes,
>
> I am interested in helping but I have two constraints:
>
>         - With my current schedule I wont have free time for another 2-3 
> months
>         - My skillset is more on the end user / business side.  My main job 
> is on a trading desk and I am driving our efforts to build out more analytic 
> capabilities for the desk (leveraging heavily on parquet/pyarrow/pandas).  To 
> the extent you think I could still add value I'm happy to discuss further.
>
> Either way, thanks all for the work and I look forward to all the 
> developments this year.
>
> Best,
>
> Matthew M. Turner
> Email: matthew.m.tur...@outlook.com
> Phone: (908)-868-2786
>
> -----Original Message-----
> From: Wes McKinney <wesmck...@gmail.com>
> Sent: Monday, February 10, 2020 10:33 AM
> To: dev <dev@arrow.apache.org>
> Subject: Re: Arrow Datasets Functionality for Python
>
> I will add that I'm interested in being involved with developing high level 
> Python interfaces to all of this functionality (e.g. using Ibis [1]). It 
> would be worth prototyping at least a datasets interface layer for efficient 
> data selection (predicate pushdown + filtering) and then expanding to support 
> more analytic operations as they are implemented and available in pyarrow. 
> There's just a lot of work to do and at the moment not a lot of people to do 
> it. Hopefully more organizations will sponsor part- or full-time developers 
> to get involved in Apache Arrow development and help with maintenance and 
> feature development -- this is a challenging project to contribute to on 
> nights/weekends.
>
> [1]: https://github.com/ibis-project/ibis
>
> On Mon, Feb 10, 2020 at 8:34 AM Francois Saint-Jacques 
> <fsaintjacq...@gmail.com> wrote:
> >
> > Hello Matthew,
> >
> > The dplyr binding is just syntactic sugar on top of the dataset API.
> > There's no analytics capabilities yet [1], other than the select and
> > the limited projection supported by the dataset API. It looks like it
> > is doing analytics due to properly placed `collect()` calls, which
> > converts from Arrow's stream of RecordBatch to R internal data frames.
> > The analytic work is done by R. The same functionality exists under
> > python, you invoke the dataset scan and then pass the result to
> > pandas.
> >
> > In 2020 [2], we are actively working toward an analytic engine, with
> > bindings for R *and* Python. Within this engine, we have physical
> > operators, or compute kernels, that can be seen as functions that
> > takes a stream of RecordBatch and yields a new stream of RecordBatch.
> > The dataset API is the Scan physical operators, i.e. it materialize a
> > stream of RecordBatch from files or other sources. Gandiva is a
> > compiler that generates the Filter and Project physical operators.
> > Think of gandiva as a physical operator factory, you give it a
> > predicate (or multiple expression in the case of projection) and it
> > gives you back a function pointer that knows how to evaluate this
> > predicate (expressions) on a RecordBatch and yields a RecordBatch.
> > There still needs to be a coordinator on top of both that "plugs"
> > them, i.e. the execution engine.
> >
> > Hope this helps,
> > François
> >
> > [1]
> > https://github.com/apache/arrow/blob/6600a39ffe149971afd5ad3c78c2b538c
> > dc03cfd/r/R/dplyr.R#L255-L322 [2]
> > https://ursalabs.org/blog/2020-outlook/
> >
> >
> >
> > On Sun, Feb 9, 2020 at 11:24 PM Matthew Turner
> > <matthew.m.tur...@outlook.com> wrote:
> > >
> > > Hi Wes / Arrow Dev Team,
> > >
> > > Following up on our brief twitter 
> > > convo<https://twitter.com/wesmckinn/status/1222647039252525057> on the 
> > > Datasets functionality in R / Python.
> > >
> > > To provide context to others, you had mentioned that the API in python / 
> > > pyarrow was more developer centric and intended for users to consume it 
> > > through higher level interfaces(i.e. IBIS).  This was in comparison to 
> > > dplyr which from your demo had some nice analytic capabilities on top of 
> > > Arrow Datasets.
> > >
> > > Seeing that demonstration made me interested to see similar Arrow 
> > > Datasets functionality within Python.  But it doesn't seem that is an 
> > > intended capability for pyarrow which I do generally understand.  
> > > However, I was trying to understand how Gandiva ties into the Arrow 
> > > project as I understand that to be an analytic engine of sorts (maybe im 
> > > misunderstanding).  I saw 
> > > this<http://blog.christianperone.com/tag/python/> implementation of 
> > > Gandiva with pandas which was quite interesting and was wondering if this 
> > > is the strategic goal - to have Gandiva be a component of third party 
> > > tools who use arrow or if Gandiva would eventually be more of a core 
> > > analytic component of Arrow.
> > >
> > > Extending on this I hoping to get the teams view on what they see as the 
> > > likely home of dplyr datasets type functionality within the python 
> > > ecosystem (i.e. IBIS or something else).
> > >
> > > Thanks to all for your work on the project!
> > >
> > > Best,
> > >
> > > Matthew M. Turner
> > > Email:
> > > matthew.m.tur...@outlook.com<mailto:matthew.m.tur...@outlook.com>
> > > Phone: (908)-868-2786
> > >

Re: Arrow Datasets Functionality for Python

Reply via email to