Re: General questions about Arrow & Plasma

Philipp Moritz Thu, 16 Nov 2017 10:38:28 -0800

Here are some more examples on how to interact between Plasma and Arrow:
http://arrow.apache.org/docs/python/plasma.html, see also the C++
documentation: http://arrow.apache.org/docs/cpp/md_tutorials_plasma.html


On Thu, Nov 16, 2017 at 10:31 AM, Philipp Moritz <pcmor...@gmail.com> wrote:

> Hey Matthias,
>
> 1. The way it is done is as in https://github.com/apache/a
> rrow/blob/c6295f3b74bcc2fa9ea1b9442f922bf564669b8e/python/
> pyarrow/plasma.pyx#L394: You first create the arrow object (using the
> builder from C++ or the python functions), get it's size, create a plasma
> object of the required size, use the FixedSizeBufferWriter to copy the data
> into shared memory (this is doing a multithreaded memcopy which is pretty
> fast, for large objects we measure 15GB/s), and then seal the object. Both
> of these can be done both with the C++ and Python APIs.
>
> 2. Using mmap by hand works and if you just want to exchange some data via
> a POSIX file system interface it might be a good solution. Using Plasma has
> a number of advantages:
> a) It takes care of object lifetime management on a per object basis
> between the runtimes for you
> b) It can be used to synchronize object access between processes
> (plasma.get yields when the creator calls plasma.seal)
> c) It supports small objects of a few bytes to a few hundred bytes
> efficiently by letting them share memory mapped files
> d) If combined with the plasma manager from Ray, it allows to ship objects
> between machines easily and also has some more object synchronization via
> plasma.wait
>
> We plan to do some improvements to the C++ API and make it so
> plasma::Create return an arrow ResizableBuffer object, then from C++ it
> will be easy to create arrow data with builders without copies and our
> Python serialization will also be able to take limited advantage of this.
>
> -- Philipp
>
> On Thu, Nov 16, 2017 at 7:30 AM, Matthias Vallentin <matth...@berkeley.edu
> > wrote:
>
>> Two question about Plasma; my use case is sharing Arrow data between a
>> C++ and Python application (eventually also R).
>> 1. What's the typical memory allocation procedure when using Plasma and
>>  Arrow? Do I first construct a builder, populate it, finish it, and
>>  *then* copy it into mmaped buffer? Or do I obtain mmaped buffer from
>>  Plasma first, in which the builder operates incrementally until it's
>>  full? If I understand it correctly, a Plasma buffer has a fixed size,   so
>> I wonder how you accommodate the fact that the Arrow builder   constructs a
>> record batches incrementally, while at the same time   avoiding extra
>> copying of large memory chunks after finishing the   builder.
>>
>> 1. Do I need Plasma to exchange the mmapped buffers between the two
>>  apps? Or could I mmap my Arrow data manually and tell pyarrow through   a
>> different mechanism to obtain the shared buffer?
>>    Matthias
>>
>
>

Re: General questions about Arrow & Plasma

Reply via email to