hi Leif -- you've articulated almost exactly my vision for pandas interoperability with Spark via Arrow. There are some questions to sort out, like shared memory / mmap management and security / sandboxing questions, but in general moving toward a model where RPC's contain shared memory offsets to Arrow data rather than shipping large chunks of serialized data through stdin/stdout would be a huge leap forward.
On Wed, Feb 24, 2016 at 4:26 PM, Leif Walsh <leif.wa...@gmail.com> wrote: > Here's the image that popped into my mind when I heard about this project, > at best it's a motivating example, at worst it's a distraction: > > 1. Spark reads parquet from wherever into an arrow structure in shared > memory. > 2. Spark executor calls into the Python half of pyspark with a handle to > this memory. > 3. Pandas computes some summary over this data in place, produces some > smaller result set, also in memory, also in arrow format. > 4. Spark driver collects results (without ser/de overheads, just straight > byte buffer reads across the network), concatenates them in memory. > 5. Spark driver hands that memory to local pandas, which does more > computation over that in place. > > This is what got me excited, it kills a lot of the spark overheads related > to data (and for a kicker, also enables full pandas compatibility along > with statsmodels/scikit/etc in step 3). I suppose if this is not a vision > the arrow secs have, I'd be interested in how. Not looking for any promises > yet though. > On Wed, Feb 24, 2016 at 14:52 Corey Nolet <cjno...@apache.org> wrote: > > > So far, how does the integration with the Spark project look? Do you > > envision cached Spark partitions allocating Arrows? I could imagine this > > would be absolutely huge for being able to ask questions of real-time > data > > sets across applications. > > > > On Wed, Feb 24, 2016 at 2:49 PM, Zhe Zhang <z...@apache.org> wrote: > > > > > Thanks for the insights Jacques. Interesting to learn the thoughts on > > > zero-copy sharing. > > > > > > mmap allows sharing address spaces via filesystem interface. That has > > some > > > security concerns but with the immutability restrictions (as you > > clarified > > > here), it sounds feasible. What other options do you have in mind? > > > > > > On Wed, Feb 24, 2016 at 11:30 AM Corey Nolet <cjno...@gmail.com> > wrote: > > > > > > > Agreed, > > > > > > > > I thought the whole purpose was to share the memory space (using > > possibly > > > > unsafe operations like ByteBuffers) so that it could be directly > shared > > > > without copy. My interest in this is to have it enable fully > in-memory > > > > computation. Not just "processing" as in Spark, but as a fully > > in-memory > > > > datastore that one application can expose and share with others (e.g. > > In > > > > memory structure is constructed from a series of parquet files, > > somehow, > > > > then Spark pulls it in, does some computations, exposes a data set, > > > > etc...). > > > > > > > > If you are leaving the allocation of the memory to the applications > and > > > > underneath the memory is being allocated using direct bytebuffers, I > > > can't > > > > see exactly why the problem is fundamentally hard- especially if the > > > > applications themselves are worried about exposing their own memory > > > spaces. > > > > > > > > > > > > On Wed, Feb 24, 2016 at 2:17 PM, Andrew Brust < > > > > andrew.br...@bluebadgeinsights.com> wrote: > > > > > > > > > Hmm...that's not exactly how Jaques described things to me when he > > > > briefed > > > > > me on Arrow ahead of the announcement. > > > > > > > > > > -----Original Message----- > > > > > From: Zhe Zhang [mailto:z...@apache.org] > > > > > Sent: Wednesday, February 24, 2016 2:08 PM > > > > > To: dev@arrow.apache.org > > > > > Subject: Re: Question about mutability > > > > > > > > > > I don't think one application/process's memory space will be made > > > > > available to other applications/processes. It's fundamentally hard > > for > > > > > processes to share their address spaces. > > > > > > > > > > IIUC, with Arrow, when application A shares data with application > B, > > > the > > > > > data is still duplicated in the memory spaces of A and B. It's just > > > that > > > > > data serialization/deserialization are much faster with Arrow > > (compared > > > > > with Protobuf). > > > > > > > > > > On Wed, Feb 24, 2016 at 10:40 AM Corey Nolet <cjno...@gmail.com> > > > wrote: > > > > > > > > > > > Forgive me if this question seems ill-informed. I just started > > > looking > > > > > > at Arrow yesterday. I looked around the github a tad. > > > > > > > > > > > > Are you expecting the memory space held by one application to be > > > > > > mutable by that application and made available to all > applications > > > > > > trying to read the memory space? > > > > > > > > > > > > > > > > > > > > > -- > -- > Cheers, > Leif >