hi Chris, To add to Uwe's e-mail:
> In this case, the sharing is zero-serialization but not zero-copy. This depends. If an implementation supports shared memory, then zero-copy access is possible. So if you generated data in C++, you could access it in another C++ program or Python program without copying it into memory. So if you had a 50GB dataset in Arrow format on disk, you could access any column, row, or single value without any deserialization or copying if you are using the C++ libraries (or any bindings thereof, like C, Python, Ruby, etc.) Not all implementations support accessing Arrow via shared memory yet. For example, Java does not yet. - Wes On Sun, Apr 15, 2018 at 5:05 AM, Uwe L. Korn <uw...@xhochy.com> wrote: > Hello Chris, > > at the moment, we have focused on sharing Arrow structures via inter process > communication (IPC). In this case, the sharing is zero-serialization but not > zero-copy. Given that we have good integration tests now for a good subset of > all implementations, the sharing of memory between different implementation > with no copy of the data is the next step. > > As each Arrow implementation has its different user-facing data structures > with the same backing memory layout, we will have to write some APIs that can > convert one interface to another. A very simple example that takes the Java > Arrow structures and makes it available to Python is included in this PR > (comment): https://github.com/apache/arrow/pull/1693 > > Note that this is not needed for all languages. For example the Python, Ruby > and GLib implementation is all backed on the C++ implementation. Here you can > simply extract that backing C++ object and use in the other language. Thus a > pyarrow.Array created in Python already contains a C++ arrow::Array object > which then could be directly used as a backing object for Ruby. > > Uwe > > On Thu, Apr 12, 2018, at 9:22 AM, Chris Withers wrote: >> Hi All, >> >> Apologies if I'm on the wrong list or struggle to get my question >> across, I'm very new to Arrow, so please point me to the best place if >> there's somewhere better to ask these kinds of questions... >> >> So, in my mind, Arrow provides a single in-memory model that supports >> access from a bunch of different languages/environments (Pandas, Go, >> C++, etc from looking at https://github.com/apache/arrow), which gives >> me hope that, as someone just starting out on a project to go from a >> proprietary C++ trading framework's market data archive to Pandas >> dataframes would be a good way to look and, if things go through arrow >> in the middle, potentially a way for other environments (Go, Julia?) to >> make sure of the same thing. >> >> That left me wondering, however, that if I write a "to arrow" thing is >> C++, how would a Go or Python user then wire things up to get access to >> the Arrow data structures? >> Somewhat important bonus point: how would that happen without memory >> copies? (datasets here are many GB is most cases). >> >> cheers, >> >> Chris