Which I guess means technically typeids in the schema metadata should only technically be signed 8 bit because any larger wouldn't be representable in the buffer.
On Thursday, March 21, 2019, Wes McKinney <[email protected]> wrote: > Yes, the "typeIds" field in the metadata are the codes that correspond > to each type; the actual data uses 1 byte per value > > So we might have something like > > typeIds: [0, 5, 10] > typeIds buffer: [0, 5, 10, 10, 10, 10, 0, 5, 10, 0] > > Relatedly, we will have to start a new mailing list discussion about > reconciling the Union format > > - Wes > > On Wed, Mar 20, 2019 at 3:49 AM Micah Kornfield <[email protected]> > wrote: > > > > Hi Paul, > > TL;DR; I think the the typeIds field you referenced is not the offset for > > dense vectors mentioned by the spec. I believe (but lack the historical > > context) that it is an outgrowth of the Java implementation that might be > > useful in other contexts. > > > > The requirement is that typeIDs field you referenced is that has a less > > length less the 127, the bit-width of the ID is immaterial. Also, the > > typeIDs field and unions aren't fully supported yet. There is an open PR > > [1] which got stalled on performance and long term direction concerns. > > > > I haven't fully validated this, but my rough understanding is that the > Java > > implementation assumes only one array/vector of each type is in a union. > > Roughly, each logical type + Schema.fbs enum parameterization has its own > > type with its own type ID (I think the number is still less 127 but might > > grow larger). The implementation makes use of this fact to do some > > optimizations. So when a union (I think only Sparse is supported in > Java) > > serializes itself it records each of the type IDs [2] so it can easily > map > > back to them. > > > > [1] https://github.com/apache/arrow/pull/987 > > [2] > > https://github.com/apache/arrow/blob/73d379f4631cd3013371f60876a526 > 15171e6c3b/java/vector/src/main/codegen/templates/UnionVector.java#L329 > > > > On Wed, Mar 20, 2019 at 1:08 AM Paul Taylor <[email protected]> wrote: > > > > > I noticed the the DenseUnion docs[1] says the typeIds buffer is 8-bit > > > signed integers, but in the flatbuffer schema[2] it's typed as int (and > > > flatc generates a function that returns an Int32Array). > > > > > > How are the other implementations treating this buffer, and should we > > > update the docs or the flatbuffers schema? > > > > > > Thanks, > > > > > > Paul > > > > > > 1. https://arrow.apache.org/docs/format/Layout.html#dense-union-type > > > > > > 2. > > > > > > https://github.com/apache/arrow/blob/50bc9f49671afb56594910f49b9bf3 > 4e080a70e7/format/Schema.fbs#L92 > > > > > > >
