Thanks Micah for the answers. It looks like a good plan, the IPC things to be documented separately, and the schema to be complemented to the spec I guess.
Regarding user cases between Java and c++, I thought there may be an important one, in a framework how Java layer accesses data or objects in native(c++) layer. It did be discussed quite some time before, talking about how JNI may be better than pure Java. Sounds like, native/c++ would leave much more space for SIMD things so, Java layer would just instrument native/c++ layer to load/access data in the format somewhere, perform the desired computing and then retrieve computed results to respond to end users. I'm particularly interested in this case and wonder if any plan or thoughts about this. Java to c++ may be very similar to python to c++, though I haven't looked into the python part yet. Had done a quick look at the Java part, it looks like there is little attempt to sync or unify between Java and c++ in API level, though common to the same binary representation. This would complicate the implementing of the use case I mentioned above, and cause confusing for developers when switch from one (like c++) to the other (say Java). At least, the high level constructs should be of the same name, better conforming to the spec. The Java parts look like inheriting the styles from Apache Drill I guess. Just some quick thoughts by the way, might be better to discuss separately. Regards, Kai -----Original Message----- From: Micah Kornfield [mailto:emkornfi...@gmail.com] Sent: Sunday, April 10, 2016 12:23 PM To: dev@arrow.apache.org Subject: Re: More questions about layout spec Hi Kai, Based on a previous thread on the mailing list (http://mail-archives.apache.org/mod_mbox/arrow-dev/201603.mbox/%3c72a2bcfd-54d7-4376-8199-04a5535d8...@gmail.com%3E) I believe the conclusion was there should be an optional reference implementation for IPC so consumers of the memory format aren't necessarily required to tie themselves to a particular technology, component that they don't want to consume (e.g. some users of arrow might just want the c++ objects and our not yet existent algorithms component). I think creating new document(s) to detail IPC concerns makes sense instead of updating the existing document. As you noted, Wes added the beginnings of an implementation for C++ that uses memory mapped files and https://github.com/apache/arrow/blob/master/format/Message.fbs to describe the schema. I think the decision might have been made to hold-off writing a concrete spec until we could verify a simple use-case worked between java and C++. One of the committers might have a better view on this. It probably pays to start writing up a document based on the current implementation anyways so people have broader visibility into future plans (and can provide feedback without reading the C++ code). Another mode of transport that deserves a reference specification/implementation for is how tables can be transferred via a socket (there is already a jira opened to create one via unix domain sockets but this should likely be generalized to just be sockets). I think we should open JIRAs to track writing reference specs for both shared memory and socket based transport. Thanks, -Micah On Sat, Apr 9, 2016 at 7:55 PM, Zheng, Kai <kai.zh...@intel.com> wrote: > Hi, > > Looking at the layout spec, I have some more questions to complement to the > previous ones discussed. > > About struct and union types: > > 1) The order of the fields (how they are declared in order) seems to be > important, as it will affect how the data are laid out. For example, in union > type, how to organize and interpret field types, offsets and data arrays. > Similar to struct type. > > 2) There is no saying about how their schema is represented and > how/where the schema is attached. Should the layout also contain the schema > info? In cpp codes, Table is implemented of columns and with self-contained > schema info. > > About schema: > Nothing is mentioned about schema in the spec, no sure if it should be the > nature part of it. Without self-contained schema info, it won't be able to > interpret and process the layout data (like List, Struct, Table, Union and > etc.) across machines and languages. > > About non-goals: > The follow are listed as non-goals for the document but in fact they're going > to be implemented. Should we remove them? > > 1. To specify standardized metadata or a data layout for RPC or > transient file storage. > > 2. Any "table" structure composed of named arrays each having their own > type or any other structure that composes arrays. > > Regards, > Kai >