[ https://issues.apache.org/jira/browse/ARROW-263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15429948#comment-15429948 ]
Philipp Moritz commented on ARROW-263: -------------------------------------- Hey Micah, thanks for your answer! I got the trick of unlinking the domain socket from here: https://troydhanson.github.io/network/Unix_domain_sockets.html ("Unlink before bind"). On Linux and Mac OS it seems to work and prevents leaking of the file. Note that at some point we need to introduce a named object that can be seen by all processes to bootstrap the communication between processes and this has been the least problematic way of doing that I have seen. At the moment I'm also working on a distributed version of the object store (with a separate process that can be used to ship objects between object stores on different nodes in a network) and investigating libuv to do it in a platform independent way. Libuv is a small dependency and my experience so far is pretty enjoyable. It also includes limited functionality to exchange file descriptors, but this might not work on windows (see also https://groups.google.com/forum/#!msg/libuv/0xxXBIGlzLc/H1HbL-igb84J, I haven't tried it yet). Concerning your last comment: The plasma store is a long running process that keeps its file descriptor and the data alive. Are page faults still a problem if data does not need to be reloaded from hard disk? If somebody else has a platform independent way of achieving some of these goals, I'd be happy to learn about their ideas. > Design an initial IPC mechanism for Arrow Vectors > ------------------------------------------------- > > Key: ARROW-263 > URL: https://issues.apache.org/jira/browse/ARROW-263 > Project: Apache Arrow > Issue Type: New Feature > Reporter: Micah Kornfield > Assignee: Micah Kornfield > > Prior discussion on this topic [1]. > Use-cases: > 1. User defined function (UDF) execution: One process wants to execute a > user defined function written in another language (e.g. Java executing a > function defined in python, this involves creating Arrow Arrays in java, > sending them to python and receiving a new set of Arrow Arrays produced in > python back in the java process). > 2. If a storage system and a query engine are running on the same host we > might want use IPC instead of RPC (e.g. Apache Drill querying Apache Kudu) > Assumptions: > 1. IPC mechanism should be useable from the core set of supported languages > (Java, Python, C) on POSIX and ideally windows systems. Ideally, we would > not need to add dependencies on additional libraries outside of each > languages outside of this document. > We want leverage shared memory for Arrays to avoid doubling RAM requirements > by duplicating the same Array in different memory locations. > 2. Under some circumstances shared memory might be more efficient than FIFOs > or sockets (in other scenarios they won’t see thread below). > 3. Security is not a concern for V1, we assume all processes running are > “trusted”. > Requirements: > 1.Resource management: > a. Both processes need a way of allocating memory for Arrow Arrays so > that data can be passed from one process to another. > b. There must be a mechanism to cleanup unused Arrow Arrays to limit > resource usage but avoid race conditions when processing arrays > 2. Schema negotiation - before sending data, both processes need to agree on > schema each one will produce. > Out of scope requirements: > 1. IPC channel metadata discovery is out of scope of this document. > Discovery can be provided by passing appropriate command line arguments, > configuration files or other mechanisms like RPC (in which case RPC channel > discovery is still an issue). > [1] > http://mail-archives.apache.org/mod_mbox/arrow-dev/201603.mbox/%3c8d5f7e3237b3ed47b84cf187bb17b666148e7...@shsmsx103.ccr.corp.intel.com%3E -- This message was sent by Atlassian JIRA (v6.3.4#6332)