Re: [DISCUSS] Developing a "data frame" subproject in the Arrow C++ libraries

Wes McKinney Tue, 21 May 2019 04:35:33 -0700

Comments are on now, sorry about that.

On Tue, May 21, 2019, 1:06 AM Micah Kornfield <emkornfi...@gmail.com> wrote:


> Hi Wes,
> It looks like comments are turned off on the doc, this intentional?
>
> Thanks,
> Micah
>
> On Mon, May 20, 2019 at 3:49 PM Wes McKinney <wesmck...@gmail.com> wrote:
>
> > hi folks,
> >
> > I'm interested in starting to build a so-called "data frame" interface
> > as a moderately opinionated, higher-level usability layer for
> > interacting with Arrow-based chunked in-memory data. I've had numerous
> > discussions (mostly in-person) over the last few years about this and
> > it feels to me that if we don't build something like this in Apache
> > Arrow that we could end up with several third party efforts without
> > much community discussion or collaboration, which would be sad.
> >
> > Another anti-pattern that is occurring is that users are loading data
> > into Arrow, converting to a library like pandas in order to do some
> > simple in-memory data manipulations, then converting back to Arrow.
> > This is not the intended long term mode of operation.
> >
> > I wrote in significantly more detail (~7-8 pages) about the context
> > and motivation for this project:
> >
> >
> >
> https://docs.google.com/document/d/1XHe_j87n2VHGzEbnLe786GHbbcbrzbjgG8D0IXWAeHg/edit?usp=sharing
> >
> > Note that this would be a parallel effort to go alongside the
> > previously-discussed "Query Engine" project, and the two things are
> > intended to work together. Since we are creating computational
> > kernels, this would also provide some immediacy in being able to
> > invoke kernels easily on large in-memory datasets without having to
> > wait for a more full-fledged query engine system to be developed
> >
> > The details with these kinds of projects can be bedeviling so my
> > approach would be to begin to lay down the core abstractions and basic
> > APIs and use the project to drive the agenda for kernel development
> > (which can also be used in the context of a query engine runtime).
> > From my past experience designing pandas and some other in-memory
> > analytics projects, I have some idea of the kinds of mistakes or
> > design patterns I would like to _avoid_ in this effort, but others may
> > have some experiences they can offer to inform the design approach as
> > well.
> >
> > Looking forward to comments and discussion.
> >
> > - Wes
> >
>

Re: [DISCUSS] Developing a "data frame" subproject in the Arrow C++ libraries

Reply via email to