RE: More questions about layout spec

Zheng, Kai Sun, 10 Apr 2016 06:48:27 -0700

Thanks Micah for the answers. It looks like a good plan, the IPC things to be 
documented separately, and the schema to be complemented to the spec I guess.

Regarding user cases between Java and c++, I thought there may be an important 
one, in a framework how Java layer accesses data or objects in native(c++) 
layer. It did be discussed quite some time before, talking about how JNI may be 
better than pure Java. Sounds like, native/c++ would leave much more space for 
SIMD things so, Java layer would just instrument native/c++ layer to 
load/access data in the format somewhere, perform the desired computing and 
then retrieve computed results to respond to end users. I'm particularly 
interested in this case and wonder if any plan or thoughts about this. Java to 
c++ may be very similar to python to c++, though I haven't looked into the 
python part yet.

Had done a quick look at the Java part, it looks like there is little attempt 
to sync or unify between Java and c++ in API level, though common to the same 
binary representation. This would complicate the implementing of the use case I 
mentioned above, and cause confusing for developers when switch from one (like 
c++) to the other (say Java). At least, the high level constructs should be of 
the same name, better conforming to the spec. The Java parts look like 
inheriting the styles from Apache Drill I guess.

Just some quick thoughts by the way, might be better to discuss separately.

Regards,
Kai

-----Original Message-----
From: Micah Kornfield [mailto:emkornfi...@gmail.com] 
Sent: Sunday, April 10, 2016 12:23 PM
To: dev@arrow.apache.org
Subject: Re: More questions about layout spec

Hi Kai,
Based on a previous thread on the mailing list
(http://mail-archives.apache.org/mod_mbox/arrow-dev/201603.mbox/%3c72a2bcfd-54d7-4376-8199-04a5535d8...@gmail.com%3E)
I believe the conclusion was there should be an optional reference 
implementation for IPC so consumers of the memory format aren't necessarily 
required to tie themselves to a particular technology, component that they 
don't want to consume (e.g. some users of arrow might just want the c++ objects 
and our not yet existent algorithms component).  I think creating new 
document(s) to detail IPC concerns makes sense instead of updating the existing 
document.

As you noted, Wes added the beginnings of an implementation for C++ that uses 
memory mapped files and 
https://github.com/apache/arrow/blob/master/format/Message.fbs to
describe the schema.   I think the decision might have been made to
hold-off writing a concrete spec until we could verify a simple use-case worked 
between java and C++.  One of the committers might have a better view on this.  
It probably pays to start writing up a document based on the current 
implementation anyways so people have broader visibility into future plans (and 
can provide feedback without reading the C++ code).

Another mode of transport that deserves a reference 
specification/implementation for is how tables can be transferred via a socket 
(there is already a jira opened to create one via unix domain sockets but this 
should likely be generalized to just be sockets).

I think we should open JIRAs to track writing reference specs for both shared 
memory and socket based transport.

Thanks,
-Micah

On Sat, Apr 9, 2016 at 7:55 PM, Zheng, Kai <kai.zh...@intel.com> wrote:
> Hi,
>
> Looking at the layout spec, I have some more questions to complement to the 
> previous ones discussed.
>
> About struct and union types:
>
> 1)      The order of the fields (how they are declared in order) seems to be 
> important, as it will affect how the data are laid out. For example, in union 
> type, how to organize and interpret field types, offsets and data arrays. 
> Similar to struct type.
>
> 2)      There is no saying about how their schema is represented and 
> how/where the schema is attached. Should the layout also contain the schema 
> info? In cpp codes, Table is implemented of columns and with self-contained 
> schema info.
>
> About schema:
> Nothing is mentioned about schema in the spec, no sure if it should be the 
> nature part of it. Without self-contained schema info, it won't be able to 
> interpret and process the layout data (like List, Struct, Table, Union and 
> etc.) across machines and languages.
>
> About non-goals:
> The follow are listed as non-goals for the document but in fact they're going 
> to be implemented. Should we remove them?
>
> 1.       To specify standardized metadata or a data layout for RPC or 
> transient file storage.
>
> 2.       Any "table" structure composed of named arrays each having their own 
> type or any other structure that composes arrays.
>
> Regards,
> Kai
>

RE: More questions about layout spec

Reply via email to