Hello Matthew, The dplyr binding is just syntactic sugar on top of the dataset API. There's no analytics capabilities yet [1], other than the select and the limited projection supported by the dataset API. It looks like it is doing analytics due to properly placed `collect()` calls, which converts from Arrow's stream of RecordBatch to R internal data frames. The analytic work is done by R. The same functionality exists under python, you invoke the dataset scan and then pass the result to pandas.
In 2020 [2], we are actively working toward an analytic engine, with bindings for R *and* Python. Within this engine, we have physical operators, or compute kernels, that can be seen as functions that takes a stream of RecordBatch and yields a new stream of RecordBatch. The dataset API is the Scan physical operators, i.e. it materialize a stream of RecordBatch from files or other sources. Gandiva is a compiler that generates the Filter and Project physical operators. Think of gandiva as a physical operator factory, you give it a predicate (or multiple expression in the case of projection) and it gives you back a function pointer that knows how to evaluate this predicate (expressions) on a RecordBatch and yields a RecordBatch. There still needs to be a coordinator on top of both that "plugs" them, i.e. the execution engine. Hope this helps, François [1] https://github.com/apache/arrow/blob/6600a39ffe149971afd5ad3c78c2b538cdc03cfd/r/R/dplyr.R#L255-L322 [2] https://ursalabs.org/blog/2020-outlook/ On Sun, Feb 9, 2020 at 11:24 PM Matthew Turner <matthew.m.tur...@outlook.com> wrote: > > Hi Wes / Arrow Dev Team, > > Following up on our brief twitter > convo<https://twitter.com/wesmckinn/status/1222647039252525057> on the > Datasets functionality in R / Python. > > To provide context to others, you had mentioned that the API in python / > pyarrow was more developer centric and intended for users to consume it > through higher level interfaces(i.e. IBIS). This was in comparison to dplyr > which from your demo had some nice analytic capabilities on top of Arrow > Datasets. > > Seeing that demonstration made me interested to see similar Arrow Datasets > functionality within Python. But it doesn't seem that is an intended > capability for pyarrow which I do generally understand. However, I was > trying to understand how Gandiva ties into the Arrow project as I understand > that to be an analytic engine of sorts (maybe im misunderstanding). I saw > this<http://blog.christianperone.com/tag/python/> implementation of Gandiva > with pandas which was quite interesting and was wondering if this is the > strategic goal - to have Gandiva be a component of third party tools who use > arrow or if Gandiva would eventually be more of a core analytic component of > Arrow. > > Extending on this I hoping to get the teams view on what they see as the > likely home of dplyr datasets type functionality within the python ecosystem > (i.e. IBIS or something else). > > Thanks to all for your work on the project! > > Best, > > Matthew M. Turner > Email: matthew.m.tur...@outlook.com<mailto:matthew.m.tur...@outlook.com> > Phone: (908)-868-2786 >