hi Antoine, On Tue, May 21, 2019 at 5:48 AM Antoine Pitrou <anto...@python.org> wrote: > > > Hi Wes, > > How does copy-on-write play together with memory-mapped data? It seems > that, depending on whether the memory map has several concurrent users > (a condition which may be timing-dependent), we will either persist > changes on disk or make them ephemeral in memory. That doesn't sound > very user-friendly, IMHO.
With memory-mapping, any Buffer is sliced from the parent MemoryMap [1] so mutating the data on disk using this interface wouldn't be possible with the way that I've framed it. Note that memory-mapping at all is already significantly advanced over what most people in the world are using every day. You won't find examples of memory-mapping with pandas in my book, for example, because it's not possible. So if you memory-map, perform some analytics on the mapped data (causing results to be materialized in memory), then write out the results to a new file (or set of files), that would be an innovation for most users. [1]: https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/file.cc#L353 > > Regards > > Antoine. > > > Le 21/05/2019 à 00:39, Wes McKinney a écrit : > > hi folks, > > > > I'm interested in starting to build a so-called "data frame" interface > > as a moderately opinionated, higher-level usability layer for > > interacting with Arrow-based chunked in-memory data. I've had numerous > > discussions (mostly in-person) over the last few years about this and > > it feels to me that if we don't build something like this in Apache > > Arrow that we could end up with several third party efforts without > > much community discussion or collaboration, which would be sad. > > > > Another anti-pattern that is occurring is that users are loading data > > into Arrow, converting to a library like pandas in order to do some > > simple in-memory data manipulations, then converting back to Arrow. > > This is not the intended long term mode of operation. > > > > I wrote in significantly more detail (~7-8 pages) about the context > > and motivation for this project: > > > > https://docs.google.com/document/d/1XHe_j87n2VHGzEbnLe786GHbbcbrzbjgG8D0IXWAeHg/edit?usp=sharing > > > > Note that this would be a parallel effort to go alongside the > > previously-discussed "Query Engine" project, and the two things are > > intended to work together. Since we are creating computational > > kernels, this would also provide some immediacy in being able to > > invoke kernels easily on large in-memory datasets without having to > > wait for a more full-fledged query engine system to be developed > > > > The details with these kinds of projects can be bedeviling so my > > approach would be to begin to lay down the core abstractions and basic > > APIs and use the project to drive the agenda for kernel development > > (which can also be used in the context of a query engine runtime). > > From my past experience designing pandas and some other in-memory > > analytics projects, I have some idea of the kinds of mistakes or > > design patterns I would like to _avoid_ in this effort, but others may > > have some experiences they can offer to inform the design approach as > > well. > > > > Looking forward to comments and discussion. > > > > - Wes > >