This is all very interesting stuff, but just so we’re clear: it is not Arrow’s responsibility to provide an RPC/IPC/LPC mechanism, nor facilities for resource management. If we DID decide to make this Arrow’s responsibility it would overlap with other components which specialize in such stuff.
> On Mar 16, 2016, at 9:49 AM, Jacques Nadeau <jacq...@apache.org> wrote: > > @Todd: agree entirely on prototyping design. My goal is throw out some > ideas and some POC code and then we can explore from there. > > My main thoughts have initially been around lifecycle management. I've done > some work previously where a consistently sized shared buffer using mmap > has improved performance. This is more complicated given the requirements > for providing collaborative allocation and cross process reference counts. > > With regards to whether this is more generally applicable: I think it could > ultimately be more general but I suggest we focus on the particular > application of moving long-lived arrow record batches between a producer > and a consumer initially. Constraining the problems seems like we will get > to something workable sooner. We can abstract to a more general solution as > there are other clear requirements. > > With regards to capnproto, I believe they are simply saying when they talk > about zero-copy shared memory that the structure supports that (same as any > memory-layout based design). I don't believe they actually implemented a > protocol and multi-language implementation for zero-copy cross process > communication. > > One other note to make here is that my goal here is not just about > performance but also about memory footprint. Being able to have a shared > memory protocol that allows multiple tools to interact with the same hot > dataset. > > RE: ACL, for the initial focus, I suggest that we consider the two sharing > processes are "trusted" and expect the initial Arrow API reference > implementations to manage memory access. > > Regarding other questions that Todd threw out: > > - if you are using an mmapped file in /dev/shm/, how do you make sure it > gets cleaned up if the process crashes? > >>> Agreed that it needs to get resolve. If I recall, destruction can be > applied once associated process are attached to memory and this allows the > kernel to recover once all attaching processes are destroyed. If this isn't > enough, then we may very well need a simple external coordinator. > > - how do you allocate memory to it? there's nothing ensuring that /dev/shm > doesn't swap out if you try to put too much in there, and then your > in-memory super-fast access will basically collapse under swap thrashing > >>> Simplest model initially is probably one where we assume a master and a > slave. (Ideally negotiated on initial connection.) The master is > responsible for allocating memory and giving that to the slave. The master > then is responsible for managing reasonable memory allocation limits just > like any other. Slaves that need to allocated memory must ask the master > (at whatever chunk makes sense) and will get rejected if they are too > aggressive. (this probably means that at any point an IPC can fall back to > RPC??) > > - how do you do lifecycle management across the two processes? If, say, > Kudu wants to pass a block of data to some Python program, how does it know > when the Python program is done reading it and it should be deleted? What > if the python program crashed in the middle - when can Kudu release it? > >>> My thinking, as mentioned earlier, is a shared reference count model for > complex situations. Possibly a "request/response" ownership model for > simpler cases. > > - how do you do security? If both sides of the connection don't trust each > other, and use length prefixes and offsets, you have to be constantly > validating and re-validating everything you read. > > I'm suggesting that we start with trusting so we don't get too wrapped up > in all the extra complexities of security. My experience with these things > is that a lot of users will frequently pick performance or footprint over > security for quite some time. For example, if I recall correctly, on the > shared file descriptor model that was initially implemented in the HDFS > client, that people used short-circuit reads for years before security was > correctly implemented. (Am I remembering this right?) > > Lastly, as I mentioned above, I don't think there should be any requirement > that Arrow communication be limited to only 'IPC'. As Todd points out, in > many cases unix domain sockets will be just fine. > > We need to implement both models because we all know that locality will > never be guaranteed. The IPC design/implementation needs to be good for > anything to make into arrow. > > thanks > Jacques > > > > On Wed, Mar 16, 2016 at 8:54 AM, Zhe Zhang <z...@apache.org> wrote: > >> I have similar concerns as Todd stated below. With an mmap-based approach, >> we are treating shared memory objects like files. This brings in all >> filesystem related considerations like ACL and lifecycle mgmt. >> >> Stepping back a little, the shared-memory work isn't really specific to >> Arrow. A few questions related to this: >> 1) Has the topic been discussed in the context of protobuf (or other IPC >> protocols) before? Seems Cap'n Proto (https://capnproto.org/) has >> zero-copy >> shared memory. I haven't read implementation detail though. >> 2) If the shared-memory work benefits a wide range of protocols, should it >> be a generalized and standalone library? >> >> Thanks, >> Zhe >> >> On Tue, Mar 15, 2016 at 8:30 PM Todd Lipcon <t...@cloudera.com> wrote: >> >>> Having thought about this quite a bit in the past, I think the mechanics >> of >>> how to share memory are by far the easiest part. The much harder part is >>> the resource management and ownership. Questions like: >>> >>> - if you are using an mmapped file in /dev/shm/, how do you make sure it >>> gets cleaned up if the process crashes? >>> - how do you allocate memory to it? there's nothing ensuring that >> /dev/shm >>> doesn't swap out if you try to put too much in there, and then your >>> in-memory super-fast access will basically collapse under swap thrashing >>> - how do you do lifecycle management across the two processes? If, say, >>> Kudu wants to pass a block of data to some Python program, how does it >> know >>> when the Python program is done reading it and it should be deleted? What >>> if the python program crashed in the middle - when can Kudu release it? >>> - how do you do security? If both sides of the connection don't trust >> each >>> other, and use length prefixes and offsets, you have to be constantly >>> validating and re-validating everything you read. >>> >>> Another big factor is that shared memory is not, in my experience, >>> immediately faster than just copying data over a unix domain socket. In >>> particular, the first time you read an mmapped file, you'll end up paying >>> minor page fault overhead on every page. This can be improved with >>> HugePages, but huge page mmaps are not supported yet in current Linux >> (work >>> going on currently to address this). So you're left with hugetlbfs, which >>> involves static allocations and much more pain. >>> >>> All the above is a long way to say: let's make sure we do the write >>> prototyping and up-front design before jumping into code. >>> >>> -Todd >>> >>> >>> >>> On Tue, Mar 15, 2016 at 5:54 PM, Jacques Nadeau <jacq...@apache.org> >>> wrote: >>> >>>> @Corey >>>> The POC Steven and Wes are working on is based on MappedBuffer but I'm >>>> looking at using netty's fork of tcnative to use shared memory >> directly. >>>> >>>> @Yiannis >>>> We need to have both RPC and a shared memory mechanisms (what I'm >>> inclined >>>> to call IPC but is a specific kind of IPC). The idea is we negotiate >> via >>>> RPC and then if we determine shared locality, we work over shared >> memory >>>> (preferably for both data and control). So the system interacting with >>>> HBase in your example would be the one responsible for placing >> collocated >>>> execution to take advantage of IPC. >>>> >>>> How do others feel of my redefinition of IPC to mean the same memory >>> space >>>> communication (either via shared memory or rdma) versus RPC as socket >>> based >>>> communication? >>>> >>>> >>>> On Tue, Mar 15, 2016 at 5:38 PM, Corey Nolet <cjno...@gmail.com> >> wrote: >>>> >>>>> I was seeing Netty's unsafe classes being used here, not mapped byte >>>>> buffer not sure if that statement is completely correct but I'll >> have >>> to >>>>> dog through the code again to figure that out. >>>>> >>>>> The more I was looking at unsafe, it makes sense why that would be >>>>> used.apparently it's also supposed to be included on Java 9 as a >> first >>>>> class API >>>>> On Mar 15, 2016 7:03 PM, "Wes McKinney" <w...@cloudera.com> wrote: >>>>> >>>>>> My understanding is that you can use java.nio.MappedByteBuffer to >>> work >>>>>> with memory-mapped files as one way to share memory pages between >>> Java >>>>>> (and non-Java) processes without copying. >>>>>> >>>>>> I am hoping that we can reach a POC of zero-copy Arrow memory >> sharing >>>>>> Java-to-Java and Java-to-C++ in the near future. Indeed this will >>> have >>>>>> huge implications once we get it working end to end (for example, >>>>>> receiving memory from a Java process in Python without a heavy >> ser-de >>>>>> step -- it's what we've always dreamed of) and with the metadata >> and >>>>>> shared memory control flow standardized. >>>>>> >>>>>> - Wes >>>>>> >>>>>> On Wed, Mar 9, 2016 at 9:25 PM, Corey J Nolet <cjno...@gmail.com> >>>> wrote: >>>>>>> If I understand correctly, Arrow is using Netty underneath which >> is >>>>>> using Sun's Unsafe API in order to allocate direct byte buffers off >>>> heap. >>>>>> It is using Netty to communicate between "client" and "server", >>>>> information >>>>>> about memory addresses for data that is being requested. >>>>>>> >>>>>>> I've never attempted to use the Unsafe API to access off heap >>> memory >>>>>> that has been allocated in one JVM from another JVM but I'm >> assuming >>>> this >>>>>> must be the case in order to claim that the memory is being >> accessed >>>>>> directly without being copied, correct? >>>>>>> >>>>>>> The implication here is huge. If the memory is being directly >>> shared >>>>>> across processes by them being allowed to directly reach into the >>>> direct >>>>>> byte buffers, that's true shared memory. Otherwise, if there's >> copies >>>>> going >>>>>> on, it's less appealing. >>>>>>> >>>>>>> >>>>>>> Thanks. >>>>>>> >>>>>>> Sent from my iPad >>>>>> >>>>> >>>> >>> >>> >>> >>> -- >>> Todd Lipcon >>> Software Engineer, Cloudera >>> >>