hi Leif -- you've articulated almost exactly my vision for pandas
interoperability with Spark via Arrow. There are some questions to sort
out, like shared memory / mmap management and security / sandboxing
questions, but in general moving toward a model where RPC's contain shared
memory offsets to Arrow data rather than shipping large chunks of
serialized data through stdin/stdout would be a huge leap forward.

On Wed, Feb 24, 2016 at 4:26 PM, Leif Walsh <leif.wa...@gmail.com> wrote:

> Here's the image that popped into my mind when I heard about this project,
> at best it's a motivating example, at worst it's a distraction:
>
> 1. Spark reads parquet from wherever into an arrow structure in shared
> memory.
> 2. Spark executor calls into the Python half of pyspark with a handle to
> this memory.
> 3. Pandas computes some summary over this data in place, produces some
> smaller result set, also in memory, also in arrow format.
> 4. Spark driver collects results (without ser/de overheads, just straight
> byte buffer reads across the network), concatenates them in memory.
> 5. Spark driver hands that memory to local pandas, which does more
> computation over that in place.
>
> This is what got me excited, it kills a lot of the spark overheads related
> to data (and for a kicker, also enables full pandas compatibility along
> with statsmodels/scikit/etc in step 3). I suppose if this is not a vision
> the arrow secs have, I'd be interested in how. Not looking for any promises
> yet though.
> On Wed, Feb 24, 2016 at 14:52 Corey Nolet <cjno...@apache.org> wrote:
>
> > So far, how does the integration with the Spark project look? Do you
> > envision cached Spark partitions allocating Arrows? I could imagine this
> > would be absolutely huge for being able to ask questions of real-time
> data
> > sets across applications.
> >
> > On Wed, Feb 24, 2016 at 2:49 PM, Zhe Zhang <z...@apache.org> wrote:
> >
> > > Thanks for the insights Jacques. Interesting to learn the thoughts on
> > > zero-copy sharing.
> > >
> > > mmap allows sharing address spaces via filesystem interface. That has
> > some
> > > security concerns but with the immutability restrictions (as you
> > clarified
> > > here), it sounds feasible. What other options do you have in mind?
> > >
> > > On Wed, Feb 24, 2016 at 11:30 AM Corey Nolet <cjno...@gmail.com>
> wrote:
> > >
> > > > Agreed,
> > > >
> > > > I thought the whole purpose was to share the memory space (using
> > possibly
> > > > unsafe operations like ByteBuffers) so that it could be directly
> shared
> > > > without copy. My interest in this is to have it enable fully
> in-memory
> > > > computation. Not just "processing" as in Spark, but as a fully
> > in-memory
> > > > datastore that one application can expose and share with others (e.g.
> > In
> > > > memory structure is constructed from a series of parquet files,
> > somehow,
> > > > then Spark pulls it in, does some computations, exposes a data set,
> > > > etc...).
> > > >
> > > > If you are leaving the allocation of the memory to the applications
> and
> > > > underneath the memory is being allocated using direct bytebuffers, I
> > > can't
> > > > see exactly why the problem is fundamentally hard- especially if the
> > > > applications themselves are worried about exposing their own memory
> > > spaces.
> > > >
> > > >
> > > > On Wed, Feb 24, 2016 at 2:17 PM, Andrew Brust <
> > > > andrew.br...@bluebadgeinsights.com> wrote:
> > > >
> > > > > Hmm...that's not exactly how Jaques described things to me when he
> > > > briefed
> > > > > me on Arrow ahead of the announcement.
> > > > >
> > > > > -----Original Message-----
> > > > > From: Zhe Zhang [mailto:z...@apache.org]
> > > > > Sent: Wednesday, February 24, 2016 2:08 PM
> > > > > To: dev@arrow.apache.org
> > > > > Subject: Re: Question about mutability
> > > > >
> > > > > I don't think one application/process's memory space will be made
> > > > > available to other applications/processes. It's fundamentally hard
> > for
> > > > > processes to share their address spaces.
> > > > >
> > > > > IIUC, with Arrow, when application A shares data with application
> B,
> > > the
> > > > > data is still duplicated in the memory spaces of A and B. It's just
> > > that
> > > > > data serialization/deserialization are much faster with Arrow
> > (compared
> > > > > with Protobuf).
> > > > >
> > > > > On Wed, Feb 24, 2016 at 10:40 AM Corey Nolet <cjno...@gmail.com>
> > > wrote:
> > > > >
> > > > > > Forgive me if this question seems ill-informed. I just started
> > > looking
> > > > > > at Arrow yesterday. I looked around the github a tad.
> > > > > >
> > > > > > Are you expecting the memory space held by one application to be
> > > > > > mutable by that application and made available to all
> applications
> > > > > > trying to read the memory space?
> > > > > >
> > > > >
> > > >
> > >
> >
> --
> --
> Cheers,
> Leif
>

Reply via email to