hi folks, I'm interested in starting to build a so-called "data frame" interface as a moderately opinionated, higher-level usability layer for interacting with Arrow-based chunked in-memory data. I've had numerous discussions (mostly in-person) over the last few years about this and it feels to me that if we don't build something like this in Apache Arrow that we could end up with several third party efforts without much community discussion or collaboration, which would be sad.
Another anti-pattern that is occurring is that users are loading data into Arrow, converting to a library like pandas in order to do some simple in-memory data manipulations, then converting back to Arrow. This is not the intended long term mode of operation. I wrote in significantly more detail (~7-8 pages) about the context and motivation for this project: https://docs.google.com/document/d/1XHe_j87n2VHGzEbnLe786GHbbcbrzbjgG8D0IXWAeHg/edit?usp=sharing Note that this would be a parallel effort to go alongside the previously-discussed "Query Engine" project, and the two things are intended to work together. Since we are creating computational kernels, this would also provide some immediacy in being able to invoke kernels easily on large in-memory datasets without having to wait for a more full-fledged query engine system to be developed The details with these kinds of projects can be bedeviling so my approach would be to begin to lay down the core abstractions and basic APIs and use the project to drive the agenda for kernel development (which can also be used in the context of a query engine runtime). >From my past experience designing pandas and some other in-memory analytics projects, I have some idea of the kinds of mistakes or design patterns I would like to _avoid_ in this effort, but others may have some experiences they can offer to inform the design approach as well. Looking forward to comments and discussion. - Wes