On 25 February 2016 at 11:57, Todd Lipcon <t...@cloudera.com> wrote: > On Thu, Feb 25, 2016 at 11:48 AM, Henry Robinson <he...@cloudera.com> > wrote: > > It seems like Arrow would benefit from a complementary effort to define a > > (simple) streaming memory transfer protocol between processes. Although > Wes > > mentioned RPC earlier, I'd hope that's really a placeholder for "fast > > inter-process message passing", since the overheads of RPC as usually > > realised would likely be very significant, even just for control > messages. > > > > I'm a little skeptical of this as being completely generalizable, at > least in the context of storage systems. For example, in Kudu, our > scanner responses aren't just tabular data, but also contain various > bits of control info that are very Kudu-specific. Our scanner requests > are super Kudu-specific, with stuff like predicate pushdown, tablet > IDs, etc. >
The way I'm thinking about is that someone upstream makes a Kudu-specific request, but as part of that request provides a descriptor of a shared ring-buffer. Reading Arrow batches from and writing to that buffer is part of a simple standard protocol. I agree there's little justification for a general "request data from data source" protocol. Or does Kudu have very specific stuff in the "get next batch" API? > > For the use case of external UDFs or UDTFs, I think there's good > opportunity to standardize, though - that interface is a much more > "pure functional interface" where we can reasonably expect everyone to > conform to the same protocol. > > Regarding RPC overhead, so long as the batches are reasonable size, I > think it could be well amortized. Kudu's RPC system can do 300k > RPCs/sec or more on localhost, so if each one of those RPCs > corresponds to an 8MB data buffer, that's 2.4TB/sec of bandwidth > (about 100x more than RAM bandwidth). Or put another way, with 8MB > batches, assuming 24GB/sec RAM bandwidth, you only need 3k round > trips/second to saturate the bus. Even a super slow RPC control plane > should be able to handle that in its sleep. > Fair enough. What about the latency of those RPCs? > > > > This may just mean a common way to access a ring buffer with a defined > > in-memory layout and a way of signalling "buffer no longer empty". > Without > > doing this in a standardised way, we'd run the risk of N^2 communication > > protocols between N different analytics systems. > > > > Such a thing could also potentially be used for RDMA-based signalling > that > > avoids the CPU, as per this paper from MSR: > > https://www.usenix.org/conference/nsdi14/technical-sessions/dragojevic > > Agreed that RDMA is a very interesting thing to consider especially as > next gen networking systems start to become more commo. I'll check out > that paper. > It's good! So is the follow-on one in SOSP ( http://sigops.org/sosp/sosp15/current/2015-Monterey/printable/227-dragojevic.pdf) but it's less relevant to this question. > > -Todd > > > > > > > > >> > >> Another difficulty is security -- with a format like Arrow that embeds > >> internal pointers, any process following the pointers needs to either > >> trust the other processes with write access to the segment, or needs > >> to carefully verify any pointers before following them. Otherwise, > >> it's pretty easy to crash the reader if you're a malicious writer. In > >> the case of using arrow from a shared daemon (eg impalad) to access an > >> untrusted UDF (eg python running in a setuid task container), this is > >> a real concern IMO. > >> > >> -Todd > >> > >> > >> > >> On Thu, Feb 25, 2016 at 8:37 AM, Wes McKinney <w...@cloudera.com> wrote: > >> > hi Leif -- you've articulated almost exactly my vision for pandas > >> > interoperability with Spark via Arrow. There are some questions to > sort > >> > out, like shared memory / mmap management and security / sandboxing > >> > questions, but in general moving toward a model where RPC's contain > >> shared > >> > memory offsets to Arrow data rather than shipping large chunks of > >> > serialized data through stdin/stdout would be a huge leap forward. > >> > > >> > On Wed, Feb 24, 2016 at 4:26 PM, Leif Walsh <leif.wa...@gmail.com> > >> wrote: > >> > > >> >> Here's the image that popped into my mind when I heard about this > >> project, > >> >> at best it's a motivating example, at worst it's a distraction: > >> >> > >> >> 1. Spark reads parquet from wherever into an arrow structure in > shared > >> >> memory. > >> >> 2. Spark executor calls into the Python half of pyspark with a > handle to > >> >> this memory. > >> >> 3. Pandas computes some summary over this data in place, produces > some > >> >> smaller result set, also in memory, also in arrow format. > >> >> 4. Spark driver collects results (without ser/de overheads, just > >> straight > >> >> byte buffer reads across the network), concatenates them in memory. > >> >> 5. Spark driver hands that memory to local pandas, which does more > >> >> computation over that in place. > >> >> > >> >> This is what got me excited, it kills a lot of the spark overheads > >> related > >> >> to data (and for a kicker, also enables full pandas compatibility > along > >> >> with statsmodels/scikit/etc in step 3). I suppose if this is not a > >> vision > >> >> the arrow secs have, I'd be interested in how. Not looking for any > >> promises > >> >> yet though. > >> >> On Wed, Feb 24, 2016 at 14:52 Corey Nolet <cjno...@apache.org> > wrote: > >> >> > >> >> > So far, how does the integration with the Spark project look? Do > you > >> >> > envision cached Spark partitions allocating Arrows? I could imagine > >> this > >> >> > would be absolutely huge for being able to ask questions of > real-time > >> >> data > >> >> > sets across applications. > >> >> > > >> >> > On Wed, Feb 24, 2016 at 2:49 PM, Zhe Zhang <z...@apache.org> wrote: > >> >> > > >> >> > > Thanks for the insights Jacques. Interesting to learn the > thoughts > >> on > >> >> > > zero-copy sharing. > >> >> > > > >> >> > > mmap allows sharing address spaces via filesystem interface. That > >> has > >> >> > some > >> >> > > security concerns but with the immutability restrictions (as you > >> >> > clarified > >> >> > > here), it sounds feasible. What other options do you have in > mind? > >> >> > > > >> >> > > On Wed, Feb 24, 2016 at 11:30 AM Corey Nolet <cjno...@gmail.com> > >> >> wrote: > >> >> > > > >> >> > > > Agreed, > >> >> > > > > >> >> > > > I thought the whole purpose was to share the memory space > (using > >> >> > possibly > >> >> > > > unsafe operations like ByteBuffers) so that it could be > directly > >> >> shared > >> >> > > > without copy. My interest in this is to have it enable fully > >> >> in-memory > >> >> > > > computation. Not just "processing" as in Spark, but as a fully > >> >> > in-memory > >> >> > > > datastore that one application can expose and share with others > >> (e.g. > >> >> > In > >> >> > > > memory structure is constructed from a series of parquet files, > >> >> > somehow, > >> >> > > > then Spark pulls it in, does some computations, exposes a data > >> set, > >> >> > > > etc...). > >> >> > > > > >> >> > > > If you are leaving the allocation of the memory to the > >> applications > >> >> and > >> >> > > > underneath the memory is being allocated using direct > >> bytebuffers, I > >> >> > > can't > >> >> > > > see exactly why the problem is fundamentally hard- especially > if > >> the > >> >> > > > applications themselves are worried about exposing their own > >> memory > >> >> > > spaces. > >> >> > > > > >> >> > > > > >> >> > > > On Wed, Feb 24, 2016 at 2:17 PM, Andrew Brust < > >> >> > > > andrew.br...@bluebadgeinsights.com> wrote: > >> >> > > > > >> >> > > > > Hmm...that's not exactly how Jaques described things to me > when > >> he > >> >> > > > briefed > >> >> > > > > me on Arrow ahead of the announcement. > >> >> > > > > > >> >> > > > > -----Original Message----- > >> >> > > > > From: Zhe Zhang [mailto:z...@apache.org] > >> >> > > > > Sent: Wednesday, February 24, 2016 2:08 PM > >> >> > > > > To: dev@arrow.apache.org > >> >> > > > > Subject: Re: Question about mutability > >> >> > > > > > >> >> > > > > I don't think one application/process's memory space will be > >> made > >> >> > > > > available to other applications/processes. It's fundamentally > >> hard > >> >> > for > >> >> > > > > processes to share their address spaces. > >> >> > > > > > >> >> > > > > IIUC, with Arrow, when application A shares data with > >> application > >> >> B, > >> >> > > the > >> >> > > > > data is still duplicated in the memory spaces of A and B. > It's > >> just > >> >> > > that > >> >> > > > > data serialization/deserialization are much faster with Arrow > >> >> > (compared > >> >> > > > > with Protobuf). > >> >> > > > > > >> >> > > > > On Wed, Feb 24, 2016 at 10:40 AM Corey Nolet < > cjno...@gmail.com > >> > > >> >> > > wrote: > >> >> > > > > > >> >> > > > > > Forgive me if this question seems ill-informed. I just > started > >> >> > > looking > >> >> > > > > > at Arrow yesterday. I looked around the github a tad. > >> >> > > > > > > >> >> > > > > > Are you expecting the memory space held by one application > to > >> be > >> >> > > > > > mutable by that application and made available to all > >> >> applications > >> >> > > > > > trying to read the memory space? > >> >> > > > > > > >> >> > > > > > >> >> > > > > >> >> > > > >> >> > > >> >> -- > >> >> -- > >> >> Cheers, > >> >> Leif > >> >> > >> > >> > >> > >> -- > >> Todd Lipcon > >> Software Engineer, Cloudera > >> > > > > > > > > -- > > Henry Robinson > > Software Engineer > > Cloudera > > 415-994-6679 > > > > -- > Todd Lipcon > Software Engineer, Cloudera > -- Henry Robinson Software Engineer Cloudera 415-994-6679