Re: [DISCUSS] Developing a "data frame" subproject in the Arrow C++ libraries

Antoine Pitrou Tue, 21 May 2019 03:48:57 -0700


Hi Wes,


How does copy-on-write play together with memory-mapped data?  It seems
that, depending on whether the memory map has several concurrent users
(a condition which may be timing-dependent), we will either persist
changes on disk or make them ephemeral in memory.  That doesn't sound
very user-friendly, IMHO.

Regards

Antoine.


Le 21/05/2019 à 00:39, Wes McKinney a écrit :
> hi folks,
> 
> I'm interested in starting to build a so-called "data frame" interface
> as a moderately opinionated, higher-level usability layer for
> interacting with Arrow-based chunked in-memory data. I've had numerous
> discussions (mostly in-person) over the last few years about this and
> it feels to me that if we don't build something like this in Apache
> Arrow that we could end up with several third party efforts without
> much community discussion or collaboration, which would be sad.
> 
> Another anti-pattern that is occurring is that users are loading data
> into Arrow, converting to a library like pandas in order to do some
> simple in-memory data manipulations, then converting back to Arrow.
> This is not the intended long term mode of operation.
> 
> I wrote in significantly more detail (~7-8 pages) about the context
> and motivation for this project:
> 
> https://docs.google.com/document/d/1XHe_j87n2VHGzEbnLe786GHbbcbrzbjgG8D0IXWAeHg/edit?usp=sharing
> 
> Note that this would be a parallel effort to go alongside the
> previously-discussed "Query Engine" project, and the two things are
> intended to work together. Since we are creating computational
> kernels, this would also provide some immediacy in being able to
> invoke kernels easily on large in-memory datasets without having to
> wait for a more full-fledged query engine system to be developed
> 
> The details with these kinds of projects can be bedeviling so my
> approach would be to begin to lay down the core abstractions and basic
> APIs and use the project to drive the agenda for kernel development
> (which can also be used in the context of a query engine runtime).
> From my past experience designing pandas and some other in-memory
> analytics projects, I have some idea of the kinds of mistakes or
> design patterns I would like to _avoid_ in this effort, but others may
> have some experiences they can offer to inform the design approach as
> well.
> 
> Looking forward to comments and discussion.
> 
> - Wes
>

Re: [DISCUSS] Developing a "data frame" subproject in the Arrow C++ libraries

Reply via email to