Re: Arrow Datasets Functionality for Python

Francois Saint-Jacques Mon, 10 Feb 2020 06:34:56 -0800

Hello Matthew,

The dplyr binding is just syntactic sugar on top of the dataset API.
There's no analytics capabilities yet [1], other than the select and
the limited projection supported by the dataset API. It looks like it
is doing analytics due to properly placed `collect()` calls, which
converts from Arrow's stream of RecordBatch to R internal data frames.
The analytic work is done by R. The same functionality exists under
python, you invoke the dataset scan and then pass the result to
pandas.

In 2020 [2], we are actively working toward an analytic engine, with
bindings for R *and* Python. Within this engine, we have physical
operators, or compute kernels, that can be seen as functions that
takes a stream of RecordBatch and yields a new stream of RecordBatch.
The dataset API is the Scan physical operators, i.e. it materialize a
stream of RecordBatch from files or other sources. Gandiva is a
compiler that generates the Filter and Project physical operators.
Think of gandiva as a physical operator factory, you give it a
predicate (or multiple expression in the case of projection) and it
gives you back a function pointer that knows how to evaluate this
predicate (expressions) on a RecordBatch and yields a RecordBatch.
There still needs to be a coordinator on top of both that "plugs"
them, i.e. the execution engine.

Hope this helps,
François

[1] 
https://github.com/apache/arrow/blob/6600a39ffe149971afd5ad3c78c2b538cdc03cfd/r/R/dplyr.R#L255-L322
[2] https://ursalabs.org/blog/2020-outlook/

On Sun, Feb 9, 2020 at 11:24 PM Matthew Turner
<matthew.m.tur...@outlook.com> wrote:
>
> Hi Wes / Arrow Dev Team,
>
> Following up on our brief twitter 
> convo<https://twitter.com/wesmckinn/status/1222647039252525057> on the 
> Datasets functionality in R / Python.
>
> To provide context to others, you had mentioned that the API in python / 
> pyarrow was more developer centric and intended for users to consume it 
> through higher level interfaces(i.e. IBIS).  This was in comparison to dplyr 
> which from your demo had some nice analytic capabilities on top of Arrow 
> Datasets.
>
> Seeing that demonstration made me interested to see similar Arrow Datasets 
> functionality within Python.  But it doesn't seem that is an intended 
> capability for pyarrow which I do generally understand.  However, I was 
> trying to understand how Gandiva ties into the Arrow project as I understand 
> that to be an analytic engine of sorts (maybe im misunderstanding).  I saw 
> this<http://blog.christianperone.com/tag/python/> implementation of Gandiva 
> with pandas which was quite interesting and was wondering if this is the 
> strategic goal - to have Gandiva be a component of third party tools who use 
> arrow or if Gandiva would eventually be more of a core analytic component of 
> Arrow.
>
> Extending on this I hoping to get the teams view on what they see as the 
> likely home of dplyr datasets type functionality within the python ecosystem 
> (i.e. IBIS or something else).
>
> Thanks to all for your work on the project!
>
> Best,
>
> Matthew M. Turner
> Email: matthew.m.tur...@outlook.com<mailto:matthew.m.tur...@outlook.com>
> Phone: (908)-868-2786
>

Re: Arrow Datasets Functionality for Python

Reply via email to