These IPC details we should definitely document outside of the code. For the String/Binary type question, I want to start a document that explains the logical data types in Message.fbs in terms of
- what Arrow memory layout they use (for example: Int32 uses fixed bit width 32 bits), and String uses List<UInt8> (with the restriction that the inner buffer must not have any nulls, and the validity bitmap is omitted) - any type-specific custom logic around deconstructing or reconstructing an Arrow container in an IPC/RPC setting. What we have been debating in this e-mail thread is altering the appearance of String/Binary representation in a record batch (https://github.com/apache/arrow/blob/master/format/Message.fbs#L147) from being identical to List<UInt8> (4 buffers in the flattened buffer list -- 2 for the list node, 2 for the UInt8 node) to its own "collapsed" form (3 buffers: bitmap, offsets, data). This means that any code that is sending/receiving a record batch will need separate code paths to handle List and String respectively (in the C++ code, we are currently using the same code path for both) For example, changes like ARROW-253 (https://github.com/apache/arrow/commit/dc01f099d966b92f4de7679b4a1caf97c363e08e) would be documented outside of the code and message IDL. I will open one or more JIRAs and write a patch to try to close the loop on this. - Wes On Tue, Aug 16, 2016 at 6:04 AM, Julien Le Dem <jul...@dremio.com> wrote: > There's ARROW-258 which is about clarifying difference (if any) in metadata > across RPC (sockets), IPC (shared memory) and files. > The vector layout is the same except in RPC or files they get concatenated > together when copied over. > The metadata should be mostly the same (ideally the same). Buffer offsets > are relative to the beginning of the body in the context of RPC and file > start in files. In the context of IPC it looks like we need an extra page id > (from Message.fbs). Is this correct? > > On Mon, Aug 15, 2016 at 12:01 PM, Micah Kornfield <emkornfi...@gmail.com> > wrote: >> >> Thanks Wes, >> This makes sense. +1 on the "Logical Types / IPC layout >> document" is there a JIRA open for this? >> >> I'll open a JIRA item to change the inheritance of string/binary in the >> C++ code base. >> >> Thanks, >> Micah >> >> On Sun, Aug 14, 2016 at 10:51 PM, Wes McKinney <wesmck...@gmail.com> >> wrote: >>> >>> On Fri, Aug 12, 2016 at 5:57 PM, Micah Kornfield <emkornfi...@gmail.com> >>> wrote: >>> > Sorry for the late reply. >>> > >>> > This all sounds reasonable to me. But I'm not sure I understand >>> > exactly >>> > what you mean by >>> > >>> >> Accordingly, in the metadata and in RPC/IPC scenarios, binary/string >>> >> would be a single array unit in the buffer stream and flattened Field >>> >> metadata rather than nested types (2 array units as they are >>> >> presently). >>> > >>> > >>> > The way I read it this seems to me to contradict the >>> > cross-implementation as >>> > "List<UInt8-not null>"? >>> > >>> > Thanks, >>> > Micah >>> > >>> >>> I think we can resolve this by starting a "Logical Types and IPC/RPC >>> layout" specification document. >>> >>> The schema metadata >>> (https://github.com/apache/arrow/blob/master/format/Message.fbs) is, >>> as I understand it, strictly the domain of logical types. I believe >>> there is some minor conflation of the notions of primitive physical >>> types and primitive logical types. >>> >>> While String / Binary have identical physical layouts to List<UInt8 >>> not null>, in the domain of logical types and IPC, what we are saying >>> is that these types are: >>> >>> - logically speaking: primitive, non-nested types >>> - their IPC layout is the flattened version of the nested List<UInt8> >>> counterpart -- a single Field node having String type (with a null >>> count, etc.), and 3 memory buffers: validity bitmap, offsets, and >>> data. Structurally on the wire / in shared memory (compared with >>> List<UInt8 not null>) the only difference is the Field metadata (since >>> if null count is 0 for the inner UInt8 values, then there is only a >>> single buffer) -- one node versus two >>> >>> Let me know if this does not make sense. >>> >>> To move this forward I propose to begin a Logical Types / IPC layout >>> document and begin to document the mapping between logical types and >>> their physical in-memory representation and layout on the wire. >>> >>> - Wes >> >> > > > > -- > Julien