Here's the image that popped into my mind when I heard about this project, at best it's a motivating example, at worst it's a distraction:
1. Spark reads parquet from wherever into an arrow structure in shared memory. 2. Spark executor calls into the Python half of pyspark with a handle to this memory. 3. Pandas computes some summary over this data in place, produces some smaller result set, also in memory, also in arrow format. 4. Spark driver collects results (without ser/de overheads, just straight byte buffer reads across the network), concatenates them in memory. 5. Spark driver hands that memory to local pandas, which does more computation over that in place. This is what got me excited, it kills a lot of the spark overheads related to data (and for a kicker, also enables full pandas compatibility along with statsmodels/scikit/etc in step 3). I suppose if this is not a vision the arrow secs have, I'd be interested in how. Not looking for any promises yet though. On Wed, Feb 24, 2016 at 14:52 Corey Nolet <cjno...@apache.org> wrote: > So far, how does the integration with the Spark project look? Do you > envision cached Spark partitions allocating Arrows? I could imagine this > would be absolutely huge for being able to ask questions of real-time data > sets across applications. > > On Wed, Feb 24, 2016 at 2:49 PM, Zhe Zhang <z...@apache.org> wrote: > > > Thanks for the insights Jacques. Interesting to learn the thoughts on > > zero-copy sharing. > > > > mmap allows sharing address spaces via filesystem interface. That has > some > > security concerns but with the immutability restrictions (as you > clarified > > here), it sounds feasible. What other options do you have in mind? > > > > On Wed, Feb 24, 2016 at 11:30 AM Corey Nolet <cjno...@gmail.com> wrote: > > > > > Agreed, > > > > > > I thought the whole purpose was to share the memory space (using > possibly > > > unsafe operations like ByteBuffers) so that it could be directly shared > > > without copy. My interest in this is to have it enable fully in-memory > > > computation. Not just "processing" as in Spark, but as a fully > in-memory > > > datastore that one application can expose and share with others (e.g. > In > > > memory structure is constructed from a series of parquet files, > somehow, > > > then Spark pulls it in, does some computations, exposes a data set, > > > etc...). > > > > > > If you are leaving the allocation of the memory to the applications and > > > underneath the memory is being allocated using direct bytebuffers, I > > can't > > > see exactly why the problem is fundamentally hard- especially if the > > > applications themselves are worried about exposing their own memory > > spaces. > > > > > > > > > On Wed, Feb 24, 2016 at 2:17 PM, Andrew Brust < > > > andrew.br...@bluebadgeinsights.com> wrote: > > > > > > > Hmm...that's not exactly how Jaques described things to me when he > > > briefed > > > > me on Arrow ahead of the announcement. > > > > > > > > -----Original Message----- > > > > From: Zhe Zhang [mailto:z...@apache.org] > > > > Sent: Wednesday, February 24, 2016 2:08 PM > > > > To: dev@arrow.apache.org > > > > Subject: Re: Question about mutability > > > > > > > > I don't think one application/process's memory space will be made > > > > available to other applications/processes. It's fundamentally hard > for > > > > processes to share their address spaces. > > > > > > > > IIUC, with Arrow, when application A shares data with application B, > > the > > > > data is still duplicated in the memory spaces of A and B. It's just > > that > > > > data serialization/deserialization are much faster with Arrow > (compared > > > > with Protobuf). > > > > > > > > On Wed, Feb 24, 2016 at 10:40 AM Corey Nolet <cjno...@gmail.com> > > wrote: > > > > > > > > > Forgive me if this question seems ill-informed. I just started > > looking > > > > > at Arrow yesterday. I looked around the github a tad. > > > > > > > > > > Are you expecting the memory space held by one application to be > > > > > mutable by that application and made available to all applications > > > > > trying to read the memory space? > > > > > > > > > > > > > > > -- -- Cheers, Leif