Here are some more examples on how to interact between Plasma and Arrow: http://arrow.apache.org/docs/python/plasma.html, see also the C++ documentation: http://arrow.apache.org/docs/cpp/md_tutorials_plasma.html
On Thu, Nov 16, 2017 at 10:31 AM, Philipp Moritz <pcmor...@gmail.com> wrote: > Hey Matthias, > > 1. The way it is done is as in https://github.com/apache/a > rrow/blob/c6295f3b74bcc2fa9ea1b9442f922bf564669b8e/python/ > pyarrow/plasma.pyx#L394: You first create the arrow object (using the > builder from C++ or the python functions), get it's size, create a plasma > object of the required size, use the FixedSizeBufferWriter to copy the data > into shared memory (this is doing a multithreaded memcopy which is pretty > fast, for large objects we measure 15GB/s), and then seal the object. Both > of these can be done both with the C++ and Python APIs. > > 2. Using mmap by hand works and if you just want to exchange some data via > a POSIX file system interface it might be a good solution. Using Plasma has > a number of advantages: > a) It takes care of object lifetime management on a per object basis > between the runtimes for you > b) It can be used to synchronize object access between processes > (plasma.get yields when the creator calls plasma.seal) > c) It supports small objects of a few bytes to a few hundred bytes > efficiently by letting them share memory mapped files > d) If combined with the plasma manager from Ray, it allows to ship objects > between machines easily and also has some more object synchronization via > plasma.wait > > We plan to do some improvements to the C++ API and make it so > plasma::Create return an arrow ResizableBuffer object, then from C++ it > will be easy to create arrow data with builders without copies and our > Python serialization will also be able to take limited advantage of this. > > -- Philipp > > On Thu, Nov 16, 2017 at 7:30 AM, Matthias Vallentin <matth...@berkeley.edu > > wrote: > >> Two question about Plasma; my use case is sharing Arrow data between a >> C++ and Python application (eventually also R). >> 1. What's the typical memory allocation procedure when using Plasma and >> Arrow? Do I first construct a builder, populate it, finish it, and >> *then* copy it into mmaped buffer? Or do I obtain mmaped buffer from >> Plasma first, in which the builder operates incrementally until it's >> full? If I understand it correctly, a Plasma buffer has a fixed size, so >> I wonder how you accommodate the fact that the Arrow builder constructs a >> record batches incrementally, while at the same time avoiding extra >> copying of large memory chunks after finishing the builder. >> >> 1. Do I need Plasma to exchange the mmapped buffers between the two >> apps? Or could I mmap my Arrow data manually and tell pyarrow through a >> different mechanism to obtain the shared buffer? >> Matthias >> > >