hi Matthew, Thanks -- our contribution workflow is roughly to define JIRA tickets and then submit pull requests. Adding documentation or examples is also helpful
When there is uncertainty about the scope of a ticket or the solution approach, feel free to ask questions and we will try to provide feedback as able. - Wes On Mon, Feb 17, 2020 at 9:06 PM Matthew Turner <matthew.m.tur...@outlook.com> wrote: > > Hi Francois, > > Thanks for the response - the explanation definitely helped and I will review > the provided documents. > > Hi Wes, > > I am interested in helping but I have two constraints: > > - With my current schedule I wont have free time for another 2-3 > months > - My skillset is more on the end user / business side. My main job > is on a trading desk and I am driving our efforts to build out more analytic > capabilities for the desk (leveraging heavily on parquet/pyarrow/pandas). To > the extent you think I could still add value I'm happy to discuss further. > > Either way, thanks all for the work and I look forward to all the > developments this year. > > Best, > > Matthew M. Turner > Email: matthew.m.tur...@outlook.com > Phone: (908)-868-2786 > > -----Original Message----- > From: Wes McKinney <wesmck...@gmail.com> > Sent: Monday, February 10, 2020 10:33 AM > To: dev <dev@arrow.apache.org> > Subject: Re: Arrow Datasets Functionality for Python > > I will add that I'm interested in being involved with developing high level > Python interfaces to all of this functionality (e.g. using Ibis [1]). It > would be worth prototyping at least a datasets interface layer for efficient > data selection (predicate pushdown + filtering) and then expanding to support > more analytic operations as they are implemented and available in pyarrow. > There's just a lot of work to do and at the moment not a lot of people to do > it. Hopefully more organizations will sponsor part- or full-time developers > to get involved in Apache Arrow development and help with maintenance and > feature development -- this is a challenging project to contribute to on > nights/weekends. > > [1]: https://github.com/ibis-project/ibis > > On Mon, Feb 10, 2020 at 8:34 AM Francois Saint-Jacques > <fsaintjacq...@gmail.com> wrote: > > > > Hello Matthew, > > > > The dplyr binding is just syntactic sugar on top of the dataset API. > > There's no analytics capabilities yet [1], other than the select and > > the limited projection supported by the dataset API. It looks like it > > is doing analytics due to properly placed `collect()` calls, which > > converts from Arrow's stream of RecordBatch to R internal data frames. > > The analytic work is done by R. The same functionality exists under > > python, you invoke the dataset scan and then pass the result to > > pandas. > > > > In 2020 [2], we are actively working toward an analytic engine, with > > bindings for R *and* Python. Within this engine, we have physical > > operators, or compute kernels, that can be seen as functions that > > takes a stream of RecordBatch and yields a new stream of RecordBatch. > > The dataset API is the Scan physical operators, i.e. it materialize a > > stream of RecordBatch from files or other sources. Gandiva is a > > compiler that generates the Filter and Project physical operators. > > Think of gandiva as a physical operator factory, you give it a > > predicate (or multiple expression in the case of projection) and it > > gives you back a function pointer that knows how to evaluate this > > predicate (expressions) on a RecordBatch and yields a RecordBatch. > > There still needs to be a coordinator on top of both that "plugs" > > them, i.e. the execution engine. > > > > Hope this helps, > > François > > > > [1] > > https://github.com/apache/arrow/blob/6600a39ffe149971afd5ad3c78c2b538c > > dc03cfd/r/R/dplyr.R#L255-L322 [2] > > https://ursalabs.org/blog/2020-outlook/ > > > > > > > > On Sun, Feb 9, 2020 at 11:24 PM Matthew Turner > > <matthew.m.tur...@outlook.com> wrote: > > > > > > Hi Wes / Arrow Dev Team, > > > > > > Following up on our brief twitter > > > convo<https://twitter.com/wesmckinn/status/1222647039252525057> on the > > > Datasets functionality in R / Python. > > > > > > To provide context to others, you had mentioned that the API in python / > > > pyarrow was more developer centric and intended for users to consume it > > > through higher level interfaces(i.e. IBIS). This was in comparison to > > > dplyr which from your demo had some nice analytic capabilities on top of > > > Arrow Datasets. > > > > > > Seeing that demonstration made me interested to see similar Arrow > > > Datasets functionality within Python. But it doesn't seem that is an > > > intended capability for pyarrow which I do generally understand. > > > However, I was trying to understand how Gandiva ties into the Arrow > > > project as I understand that to be an analytic engine of sorts (maybe im > > > misunderstanding). I saw > > > this<http://blog.christianperone.com/tag/python/> implementation of > > > Gandiva with pandas which was quite interesting and was wondering if this > > > is the strategic goal - to have Gandiva be a component of third party > > > tools who use arrow or if Gandiva would eventually be more of a core > > > analytic component of Arrow. > > > > > > Extending on this I hoping to get the teams view on what they see as the > > > likely home of dplyr datasets type functionality within the python > > > ecosystem (i.e. IBIS or something else). > > > > > > Thanks to all for your work on the project! > > > > > > Best, > > > > > > Matthew M. Turner > > > Email: > > > matthew.m.tur...@outlook.com<mailto:matthew.m.tur...@outlook.com> > > > Phone: (908)-868-2786 > > >