[ 
https://issues.apache.org/jira/browse/ARROW-263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427632#comment-15427632
 ] 

Philipp Moritz commented on ARROW-263:
--------------------------------------

Hey Micah,

thanks for your insights, seems we have been thinking along similar lines. I 
wrote one shared memory object store based on boost::IPC, which under the hood 
uses shm_open, one based on Chrome's Mojo IPC system (which is great but pulls 
in a lot of dependencies) and currently I am writing a new very lightweight one 
based on POSIX. I'm happy to share my findings:

Answering your in-depth analysis questions:
1. This is essentially the design of boost::IPC and this design makes it hard 
to ensure that named shared objects are removed properly (they cannot be 
removed automatically by the OS). If the process that is supposed to clean them 
up crashes, you will need to delete them manually.

2. Using mmaped based APIs is the way to go in my opionion. To create the file, 
you create a temporary file of size 0, keep the file descriptor around, unlink 
the file from the file system, and manipulate the file by mmaping it from the 
file descriptor. If you want to share the file with another process, you pass 
the file descriptor over a Unix Domain Socket. On Windows there also seem to be 
ways of doing this.

3. I also found the guarantees given by the JVM too restrictive. One way to do 
it which is working for me is to use JNI and to wrap some native C code. The 
alternative is to use JVM memory mapped files anyways and to hope they do the 
right thing (which they seem to do on linux), and keep the file underlying the 
memory around (with similar problems as 1).

You find my implementation here: https://github.com/pcmoritz/plasma, it 
essentially implements what you describe in *. I'd love to get your feedback, 
and if we can share some code that would be even better. I also think that Mojo 
(Google Chrome's new IPC layer) got this "open file of size 0, unlinke the 
file, keep the file descriptor around" approach working on Windows, so this 
should be ok.

The central question is then if people are ok with using JNI for java. I have 
some code to do this which I'm happy to share with you in the next couple of 
days.

All the best,
Philipp.

> Design an initial IPC mechanism for Arrow Vectors
> -------------------------------------------------
>
>                 Key: ARROW-263
>                 URL: https://issues.apache.org/jira/browse/ARROW-263
>             Project: Apache Arrow
>          Issue Type: New Feature
>            Reporter: Micah Kornfield
>            Assignee: Micah Kornfield
>
> Prior discussion on this topic [1].
> Use-cases:
> 1.  User defined function (UDF) execution:  One process wants to execute a 
> user defined function written in another language (e.g. Java executing a 
> function defined in python, this involves creating Arrow Arrays in java, 
> sending them to python and receiving a new set of Arrow Arrays produced in 
> python back in the java process).
> 2.  If a storage system and a query engine are running on the same host we 
> might want use IPC instead of RPC (e.g. Apache Drill querying Apache Kudu)
> Assumptions:
> 1.  IPC mechanism should be useable from the core set of supported languages 
> (Java, Python, C) on POSIX and ideally windows systems.  Ideally, we would 
> not need to add dependencies on additional libraries outside of each 
> languages outside of this document.
> We want leverage shared memory for Arrays to avoid doubling RAM requirements 
> by duplicating the same Array in different memory locations.  
> 2. Under some circumstances shared memory might be more efficient than FIFOs 
> or sockets (in other scenarios they won’t see thread below).
> 3. Security is not a concern for V1, we assume all processes running are 
> “trusted”.
> Requirements:
> 1.Resource management: 
>     a.  Both processes need a way of allocating memory for Arrow Arrays so 
> that data can be passed from one process to another.
>     b. There must be a mechanism to cleanup unused Arrow Arrays to limit 
> resource usage but avoid race conditions when processing arrays
> 2.  Schema negotiation - before sending data, both processes need to agree on 
> schema each one will produce.
> Out of scope requirements:
> 1.  IPC channel metadata discovery is out of scope of this document.  
> Discovery can be provided by passing appropriate command line arguments, 
> configuration files or other mechanisms like RPC (in which case RPC channel 
> discovery is still an issue).
> [1] 
> http://mail-archives.apache.org/mod_mbox/arrow-dev/201603.mbox/%3c8d5f7e3237b3ed47b84cf187bb17b666148e7...@shsmsx103.ccr.corp.intel.com%3E



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to