Re: [DISCUSS] Developing a "data frame" subproject in the Arrow C++ libraries

Wes McKinney Tue, 21 May 2019 04:43:20 -0700

hi Antoine,

On Tue, May 21, 2019 at 5:48 AM Antoine Pitrou <anto...@python.org> wrote:
>
>
> Hi Wes,
>
> How does copy-on-write play together with memory-mapped data?  It seems
> that, depending on whether the memory map has several concurrent users
> (a condition which may be timing-dependent), we will either persist
> changes on disk or make them ephemeral in memory.  That doesn't sound
> very user-friendly, IMHO.


With memory-mapping, any Buffer is sliced from the parent MemoryMap
[1] so mutating the data on disk using this interface wouldn't be
possible with the way that I've framed it.

Note that memory-mapping at all is already significantly advanced over
what most people in the world are using every day. You won't find
examples of memory-mapping with pandas in my book, for example,
because it's not possible. So if you memory-map, perform some
analytics on the mapped data (causing results to be materialized in
memory), then write out the results to a new file (or set of files),
that would be an innovation for most users.

[1]: https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/file.cc#L353

>
> Regards
>
> Antoine.
>
>
> Le 21/05/2019 à 00:39, Wes McKinney a écrit :
> > hi folks,
> >
> > I'm interested in starting to build a so-called "data frame" interface
> > as a moderately opinionated, higher-level usability layer for
> > interacting with Arrow-based chunked in-memory data. I've had numerous
> > discussions (mostly in-person) over the last few years about this and
> > it feels to me that if we don't build something like this in Apache
> > Arrow that we could end up with several third party efforts without
> > much community discussion or collaboration, which would be sad.
> >
> > Another anti-pattern that is occurring is that users are loading data
> > into Arrow, converting to a library like pandas in order to do some
> > simple in-memory data manipulations, then converting back to Arrow.
> > This is not the intended long term mode of operation.
> >
> > I wrote in significantly more detail (~7-8 pages) about the context
> > and motivation for this project:
> >
> > https://docs.google.com/document/d/1XHe_j87n2VHGzEbnLe786GHbbcbrzbjgG8D0IXWAeHg/edit?usp=sharing
> >
> > Note that this would be a parallel effort to go alongside the
> > previously-discussed "Query Engine" project, and the two things are
> > intended to work together. Since we are creating computational
> > kernels, this would also provide some immediacy in being able to
> > invoke kernels easily on large in-memory datasets without having to
> > wait for a more full-fledged query engine system to be developed
> >
> > The details with these kinds of projects can be bedeviling so my
> > approach would be to begin to lay down the core abstractions and basic
> > APIs and use the project to drive the agenda for kernel development
> > (which can also be used in the context of a query engine runtime).
> > From my past experience designing pandas and some other in-memory
> > analytics projects, I have some idea of the kinds of mistakes or
> > design patterns I would like to _avoid_ in this effort, but others may
> > have some experiences they can offer to inform the design approach as
> > well.
> >
> > Looking forward to comments and discussion.
> >
> > - Wes
> >

Re: [DISCUSS] Developing a "data frame" subproject in the Arrow C++ libraries

Reply via email to