This is all very interesting stuff, but just so we’re clear: it is not Arrow’s 
responsibility to provide an RPC/IPC/LPC mechanism, nor facilities for resource 
management. If we DID decide to make this Arrow’s responsibility it would 
overlap with other components which specialize in such stuff.



> On Mar 16, 2016, at 9:49 AM, Jacques Nadeau <jacq...@apache.org> wrote:
> 
> @Todd: agree entirely on prototyping design. My goal is throw out some
> ideas and some POC code and then we can explore from there.
> 
> My main thoughts have initially been around lifecycle management. I've done
> some work previously where a consistently sized shared buffer using mmap
> has improved performance. This is more complicated given the requirements
> for providing collaborative allocation and cross process reference counts.
> 
> With regards to whether this is more generally applicable: I think it could
> ultimately be more general but I suggest we focus on the particular
> application of moving long-lived arrow record batches between a producer
> and a consumer initially. Constraining the problems seems like we will get
> to something workable sooner. We can abstract to a more general solution as
> there are other clear requirements.
> 
> With regards to capnproto, I believe they are simply saying when they talk
> about zero-copy shared memory that the structure supports that (same as any
> memory-layout based design). I don't believe they actually implemented a
> protocol and multi-language implementation for zero-copy cross process
> communication.
> 
> One other note to make here is that my goal here is not just about
> performance but also about memory footprint. Being able to have a shared
> memory protocol that allows multiple tools to interact with the same hot
> dataset.
> 
> RE: ACL, for the initial focus, I suggest that we consider the two sharing
> processes are "trusted" and expect the initial Arrow API reference
> implementations to manage memory access.
> 
> Regarding other questions that Todd threw out:
> 
> - if you are using an mmapped file in /dev/shm/, how do you make sure it
> gets cleaned up if the process crashes?
> 
>>> Agreed that it needs to get resolve. If I recall, destruction can be
> applied once associated process are attached to memory and this allows the
> kernel to recover once all attaching processes are destroyed. If this isn't
> enough, then we may very well need a simple  external coordinator.
> 
> - how do you allocate memory to it? there's nothing ensuring that /dev/shm
> doesn't swap out if you try to put too much in there, and then your
> in-memory super-fast access will basically collapse under swap thrashing
> 
>>> Simplest model initially is probably one where we assume a master and a
> slave. (Ideally negotiated on initial connection.) The master is
> responsible for allocating memory and giving that to the slave. The master
> then is responsible for managing reasonable memory allocation limits just
> like any other. Slaves that need to allocated memory must ask the master
> (at whatever chunk makes sense) and will get rejected if they are too
> aggressive. (this probably means that at any point an IPC can fall back to
> RPC??)
> 
> - how do you do lifecycle management across the two processes? If, say,
> Kudu wants to pass a block of data to some Python program, how does it know
> when the Python program is done reading it and it should be deleted? What
> if the python program crashed in the middle - when can Kudu release it?
> 
>>> My thinking, as mentioned earlier, is a shared reference count model for
> complex situations. Possibly a "request/response" ownership model for
> simpler cases.
> 
> - how do you do security? If both sides of the connection don't trust each
> other, and use length prefixes and offsets, you have to be constantly
> validating and re-validating everything you read.
> 
> I'm suggesting that we start with trusting so we don't get too wrapped up
> in all the extra complexities of security. My experience with these things
> is that a lot of users will frequently pick performance or footprint over
> security for quite some time. For example, if I recall correctly, on the
> shared file descriptor model that was initially implemented in the HDFS
> client, that people used short-circuit reads for years before security was
> correctly implemented. (Am I remembering this right?)
> 
> Lastly, as I mentioned above, I don't think there should be any requirement
> that Arrow communication be limited to only 'IPC'. As Todd points out, in
> many cases unix domain sockets will be just fine.
> 
> We need to implement both models because we all know that locality will
> never be guaranteed. The IPC design/implementation needs to be good for
> anything to make into arrow.
> 
> thanks
> Jacques
> 
> 
> 
> On Wed, Mar 16, 2016 at 8:54 AM, Zhe Zhang <z...@apache.org> wrote:
> 
>> I have similar concerns as Todd stated below. With an mmap-based approach,
>> we are treating shared memory objects like files. This brings in all
>> filesystem related considerations like ACL and lifecycle mgmt.
>> 
>> Stepping back a little, the shared-memory work isn't really specific to
>> Arrow. A few questions related to this:
>> 1) Has the topic been discussed in the context of protobuf (or other IPC
>> protocols) before? Seems Cap'n Proto (https://capnproto.org/) has
>> zero-copy
>> shared memory. I haven't read implementation detail though.
>> 2) If the shared-memory work benefits a wide range of protocols, should it
>> be a generalized and standalone library?
>> 
>> Thanks,
>> Zhe
>> 
>> On Tue, Mar 15, 2016 at 8:30 PM Todd Lipcon <t...@cloudera.com> wrote:
>> 
>>> Having thought about this quite a bit in the past, I think the mechanics
>> of
>>> how to share memory are by far the easiest part. The much harder part is
>>> the resource management and ownership. Questions like:
>>> 
>>> - if you are using an mmapped file in /dev/shm/, how do you make sure it
>>> gets cleaned up if the process crashes?
>>> - how do you allocate memory to it? there's nothing ensuring that
>> /dev/shm
>>> doesn't swap out if you try to put too much in there, and then your
>>> in-memory super-fast access will basically collapse under swap thrashing
>>> - how do you do lifecycle management across the two processes? If, say,
>>> Kudu wants to pass a block of data to some Python program, how does it
>> know
>>> when the Python program is done reading it and it should be deleted? What
>>> if the python program crashed in the middle - when can Kudu release it?
>>> - how do you do security? If both sides of the connection don't trust
>> each
>>> other, and use length prefixes and offsets, you have to be constantly
>>> validating and re-validating everything you read.
>>> 
>>> Another big factor is that shared memory is not, in my experience,
>>> immediately faster than just copying data over a unix domain socket. In
>>> particular, the first time you read an mmapped file, you'll end up paying
>>> minor page fault overhead on every page. This can be improved with
>>> HugePages, but huge page mmaps are not supported yet in current Linux
>> (work
>>> going on currently to address this). So you're left with hugetlbfs, which
>>> involves static allocations and much more pain.
>>> 
>>> All the above is a long way to say: let's make sure we do the write
>>> prototyping and up-front design before jumping into code.
>>> 
>>> -Todd
>>> 
>>> 
>>> 
>>> On Tue, Mar 15, 2016 at 5:54 PM, Jacques Nadeau <jacq...@apache.org>
>>> wrote:
>>> 
>>>> @Corey
>>>> The POC Steven and Wes are working on is based on MappedBuffer but I'm
>>>> looking at using netty's fork of tcnative to use shared memory
>> directly.
>>>> 
>>>> @Yiannis
>>>> We need to have both RPC and a shared memory mechanisms (what I'm
>>> inclined
>>>> to call IPC but is a specific kind of IPC). The idea is we negotiate
>> via
>>>> RPC and then if we determine shared locality, we work over shared
>> memory
>>>> (preferably for both data and control). So the system interacting with
>>>> HBase in your example would be the one responsible for placing
>> collocated
>>>> execution to take advantage of IPC.
>>>> 
>>>> How do others feel of my redefinition of IPC to mean the same memory
>>> space
>>>> communication (either via shared memory or rdma) versus RPC as socket
>>> based
>>>> communication?
>>>> 
>>>> 
>>>> On Tue, Mar 15, 2016 at 5:38 PM, Corey Nolet <cjno...@gmail.com>
>> wrote:
>>>> 
>>>>> I was seeing Netty's unsafe classes being used here, not mapped byte
>>>>> buffer  not sure if that statement is completely correct but I'll
>> have
>>> to
>>>>> dog through the code again to figure that out.
>>>>> 
>>>>> The more I was looking at unsafe, it makes sense why that would be
>>>>> used.apparently it's also supposed to be included on Java 9 as a
>> first
>>>>> class API
>>>>> On Mar 15, 2016 7:03 PM, "Wes McKinney" <w...@cloudera.com> wrote:
>>>>> 
>>>>>> My understanding is that you can use java.nio.MappedByteBuffer to
>>> work
>>>>>> with memory-mapped files as one way to share memory pages between
>>> Java
>>>>>> (and non-Java) processes without copying.
>>>>>> 
>>>>>> I am hoping that we can reach a POC of zero-copy Arrow memory
>> sharing
>>>>>> Java-to-Java and Java-to-C++ in the near future. Indeed this will
>>> have
>>>>>> huge implications once we get it working end to end (for example,
>>>>>> receiving memory from a Java process in Python without a heavy
>> ser-de
>>>>>> step -- it's what we've always dreamed of) and with the metadata
>> and
>>>>>> shared memory control flow standardized.
>>>>>> 
>>>>>> - Wes
>>>>>> 
>>>>>> On Wed, Mar 9, 2016 at 9:25 PM, Corey J Nolet <cjno...@gmail.com>
>>>> wrote:
>>>>>>> If I understand correctly, Arrow is using Netty underneath which
>> is
>>>>>> using Sun's Unsafe API in order to allocate direct byte buffers off
>>>> heap.
>>>>>> It is using Netty to communicate between "client" and "server",
>>>>> information
>>>>>> about memory addresses for data that is being requested.
>>>>>>> 
>>>>>>> I've never attempted to use the Unsafe API to access off heap
>>> memory
>>>>>> that has been allocated in one JVM from another JVM but I'm
>> assuming
>>>> this
>>>>>> must be the case in order to claim that the memory is being
>> accessed
>>>>>> directly without being copied, correct?
>>>>>>> 
>>>>>>> The implication here is huge. If the memory is being directly
>>> shared
>>>>>> across processes by them being allowed to directly reach into the
>>>> direct
>>>>>> byte buffers, that's true shared memory. Otherwise, if there's
>> copies
>>>>> going
>>>>>> on, it's less appealing.
>>>>>>> 
>>>>>>> 
>>>>>>> Thanks.
>>>>>>> 
>>>>>>> Sent from my iPad
>>>>>> 
>>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Todd Lipcon
>>> Software Engineer, Cloudera
>>> 
>> 

Reply via email to