hi folks,

I'm interested in starting to build a so-called "data frame" interface
as a moderately opinionated, higher-level usability layer for
interacting with Arrow-based chunked in-memory data. I've had numerous
discussions (mostly in-person) over the last few years about this and
it feels to me that if we don't build something like this in Apache
Arrow that we could end up with several third party efforts without
much community discussion or collaboration, which would be sad.

Another anti-pattern that is occurring is that users are loading data
into Arrow, converting to a library like pandas in order to do some
simple in-memory data manipulations, then converting back to Arrow.
This is not the intended long term mode of operation.

I wrote in significantly more detail (~7-8 pages) about the context
and motivation for this project:

https://docs.google.com/document/d/1XHe_j87n2VHGzEbnLe786GHbbcbrzbjgG8D0IXWAeHg/edit?usp=sharing

Note that this would be a parallel effort to go alongside the
previously-discussed "Query Engine" project, and the two things are
intended to work together. Since we are creating computational
kernels, this would also provide some immediacy in being able to
invoke kernels easily on large in-memory datasets without having to
wait for a more full-fledged query engine system to be developed

The details with these kinds of projects can be bedeviling so my
approach would be to begin to lay down the core abstractions and basic
APIs and use the project to drive the agenda for kernel development
(which can also be used in the context of a query engine runtime).
>From my past experience designing pandas and some other in-memory
analytics projects, I have some idea of the kinds of mistakes or
design patterns I would like to _avoid_ in this effort, but others may
have some experiences they can offer to inform the design approach as
well.

Looking forward to comments and discussion.

- Wes

Reply via email to