On Tue, Oct 1, 2019 at 3:22 PM Jed Brown <[email protected]> wrote: > > I'd just like to chime in with the use case of in-situ data analysis for > simulations. This domain tends to be cautious with dependencies and > there is a lot of C and Fortran, but the in-situ analysis tools will > preferably reside in separate processes while sharing memory via shared > memory (/dev/shm or MPI_Win_allocate_shared). An in-memory protocol > that holds raw pointers would be problematic because they are typically > in different virtual address spaces when shared between processes. I > think this is a potential application for a C interface with lean > dependencies, but it wouldn't be useful if it can't be shared > out-of-process. >
hi Jed -- I will respond to Jacques's e-mail when I have some time to compose, but we're looking at a pretty different use case which is C libraries (or libraries with C FFI) exposing in-memory data structures to each other in the same process using the same virtual address space. - Wes > Jacques Nadeau <[email protected]> writes: > > > I disagree with this statement: > > > > - the IPC format is meant for serialization while the C data protocol is > > meants for in-memory communication, so different concerns apply > > > > If that is how the a particular implementation presents it, that is a > > weaknesses of the implementation, not the format. The primary use case I > > was focused on when working on the initial format was communication within > > the same process. It seems like this is being used as a basis for the > > introduction of new things when the premise is inconsistent with the > > intention of the creation. The specific reason we used flatbuffers in the > > project was to collapse the separation of in-process and out-of-process > > communication. It means the same thing it does with the Arrow data itself: > > that a consumer doesn't have to use a particular library to interact with > > and use the data. > > > > It seems like there are two ideas here: > > > > 1) How do we make it easier for people to use Arrow? > > 2) Should we implement a new in memory representation of Arrow that is > > language specific. > > > > I'm entirely in support of number one. If for a particular type of domain, > > people want an easier way to interact with Arrow, let's make a new library > > that helps with that. In easy of our current libraries, we do many things > > to make it easier to work with Arrow. None of those require a change to the > > core format or are formalized as a new in-memory standard. The in-memory > > representation of rust or javascript or java objects are implementation > > details. > > > > I'm against number two as it creates a fragmentation problem. Arrow is > > about having a single canonical format for memory for both metadata and > > data. Having multiple in-memory formats (especially when some are not > > language independent) is counter to the goals of the project. > > > > Two other, separate comments: > > 1) I don't understand the idea that we need to change the way Arrow > > fundamentally works so that people can avoid using a dependency. If the > > dependency is small, open source and easy to build, people can fork it and > > include directly if they want to. Let's not violate project principles > > because DuckDB has a religious perspective on dependencies. If the problem > > is people have to swallow too large of a pill to do basic things with Arrow > > in C, let's focus on fixing that (to our definition of ease, not someone > > else's). If FlatCC solves some those things, great. If we need to build a > > baby integration library that is more C centric, great. Neither of those > > things require implementing something at the format level. > > > > 2) It seems like we should discuss the data structure problem separately > > from the reference management concern. > > > > > > On Tue, Oct 1, 2019 at 5:42 AM Wes McKinney <[email protected]> wrote: > > > >> hi Antoine, > >> > >> On Tue, Oct 1, 2019 at 4:29 AM Antoine Pitrou <[email protected]> wrote: > >> > > >> > > >> > Le 01/10/2019 à 00:39, Wes McKinney a écrit : > >> > > A couple things: > >> > > > >> > > * I think a C protocol / FFI for Arrow array/vectors would be better > >> > > to have the same "shape" as an assembled array. Note that the C > >> > > structs here have very nearly the same "shape" as the data structure > >> > > representing a C++ Array object [1]. The disassembly and reassembly > >> > > here is substantially simpler than the IPC protocol. A recursive > >> > > structure in Flatbuffers would make RecordBatch messages much larger, > >> > > so the flattened / disassembled representation we use for serialized > >> > > record batches is the correct one > >> > > >> > I'm not sure I agree: > >> > > >> > - indeed, it's not a coincidence that the ArrowArray struct looks quite > >> > closely like the C++ ArrayData object :-) We have good experience with > >> > that abstraction and it has proven to work quite well > >> > > >> > - the IPC format is meant for serialization while the C data protocol is > >> > meants for in-memory communication, so different concerns apply > >> > > >> > - the fact that this makes the layout slightly larger doesn't seem > >> > important at all; we're not talking about transferring data over the wire > >> > > >> > There's also another argument for having a recursive struct: it > >> > simplifies how the data type is represented, since we can encode each > >> > child type individually instead of encoding it in the parent's format > >> > string (same applies for metadata and individual flags). > >> > > >> > >> I was saying something different here. I was making an argument about > >> why we use the flattened array-of-structs in the IPC protocol. One > >> reason is that it's a more compact representation. That is not very > >> important here because this protocol is only for *in-process* (for > >> languages that have a C FFI facility) rather than *inter-process* > >> communication. > >> > >> I agree also that the type encoding is simple, here, too, since we > >> aren't having to split the schema and record batch between different > >> serialized messages. There is some potential waste with having to > >> populate the type fields multiple times when communicating a sequence > >> of "chunks" from the same logical dataset. > >> > >> > > * The "formal" C protocol having the "assembled" shape means that many > >> > > minimal Arrow users won't have to implement any separate data > >> > > structures. They can just use the C struct directly or a slightly > >> > > wrapped version thereof with some convenience functions. > >> > > >> > Yes, but the same applies to the current proposal. > >> > > >> > > * I think that requiring building a Flatbuffer for minimal use cases > >> > > (e.g. communicating simple record batches with primitive types) passes > >> > > on implementation burden to minimal users. > >> > > >> > It certainly does. > >> > > >> > > I think the mantra of the C protocol should be the following: > >> > > > >> > > * Users of the protocol have to write little to no code to use it. For > >> > > example, populating an INT32 array should require only a few lines of > >> > > code > >> > > >> > Agreed. As a sidenote, the spec should have an example of doing this in > >> > raw C. > >> > > >> > Regards > >> > > >> > Antoine. > >>
