hi Chris,

To add to Uwe's e-mail:

> In this case, the sharing is zero-serialization but not zero-copy.

This depends. If an implementation supports shared memory, then
zero-copy access is possible. So if you generated data in C++, you
could access it in another C++ program or Python program without
copying it into memory. So if you had a 50GB dataset in Arrow format
on disk, you could access any column, row, or single value without any
deserialization or copying if you are using the C++ libraries (or any
bindings thereof, like C, Python, Ruby, etc.)

Not all implementations support accessing Arrow via shared memory yet.
For example, Java does not yet.

- Wes

On Sun, Apr 15, 2018 at 5:05 AM, Uwe L. Korn <uw...@xhochy.com> wrote:
> Hello Chris,
>
> at the moment, we have focused on sharing Arrow structures via inter process 
> communication (IPC). In this case, the sharing is zero-serialization but not 
> zero-copy. Given that we have good integration tests now for a good subset of 
> all implementations, the sharing of memory between different implementation 
> with no copy of the data is the next step.
>
> As each Arrow implementation has its different user-facing data structures 
> with the same backing memory layout, we will have to write some APIs that can 
> convert one interface to another. A very simple example that takes the Java 
> Arrow structures and makes it available to Python is included in this PR 
> (comment): https://github.com/apache/arrow/pull/1693
>
> Note that this is not needed for all languages. For example the Python, Ruby 
> and GLib implementation is all backed on the C++ implementation. Here you can 
> simply  extract that backing C++ object and use in the other language. Thus a 
> pyarrow.Array created in Python already contains a C++ arrow::Array object 
> which then could be directly used as a backing object for Ruby.
>
> Uwe
>
> On Thu, Apr 12, 2018, at 9:22 AM, Chris Withers wrote:
>> Hi All,
>>
>> Apologies if I'm on the wrong list or struggle to get my question
>> across, I'm very new to Arrow, so please point me to the best place if
>> there's somewhere better to ask these kinds of questions...
>>
>> So, in my mind, Arrow provides a single in-memory model that supports
>> access from a bunch of different languages/environments (Pandas, Go,
>> C++, etc from looking at https://github.com/apache/arrow), which gives
>> me hope that, as someone just starting out on a project to go from a
>> proprietary C++ trading framework's market data archive to Pandas
>> dataframes would be a good way to look and, if things go through arrow
>> in the middle, potentially a way for other environments (Go, Julia?) to
>> make sure of the same thing.
>>
>> That left me wondering, however, that if I write a "to arrow" thing is
>> C++, how would a Go or Python user then wire things up to get access to
>> the Arrow data structures?
>> Somewhat important bonus point: how would that happen without memory
>> copies? (datasets here are many GB is most cases).
>>
>> cheers,
>>
>> Chris

Reply via email to