Which I guess means technically typeids in the schema metadata should only
technically be signed 8 bit because any larger wouldn't be representable in
the buffer.


On Thursday, March 21, 2019, Wes McKinney <[email protected]> wrote:

> Yes, the "typeIds" field in the metadata are the codes that correspond
> to each type; the actual data uses 1 byte per value
>
> So we might have something like
>
> typeIds: [0, 5, 10]
> typeIds buffer: [0, 5, 10, 10, 10, 10, 0, 5, 10, 0]
>
> Relatedly, we will have to start a new mailing list discussion about
> reconciling the Union format
>
> - Wes
>
> On Wed, Mar 20, 2019 at 3:49 AM Micah Kornfield <[email protected]>
> wrote:
> >
> > Hi Paul,
> > TL;DR; I think the the typeIds field you referenced is not the offset for
> > dense vectors mentioned by the spec.  I believe (but lack the historical
> > context) that it is an outgrowth of the Java implementation that might be
> > useful in other contexts.
> >
> > The requirement is that typeIDs field you referenced is that  has a less
> > length less the 127, the bit-width of the ID is immaterial.  Also, the
> > typeIDs field and unions aren't fully supported yet.  There is an open PR
> > [1] which got stalled on performance and long term direction concerns.
> >
> > I haven't fully validated this, but my rough understanding is that the
> Java
> > implementation assumes only one array/vector of each type is in a union.
> > Roughly, each logical type + Schema.fbs enum parameterization has its own
> > type with its own type ID (I think the number is still less 127 but might
> > grow larger).  The implementation makes use of this fact to do some
> > optimizations.  So when a union (I think only Sparse is supported in
> Java)
> > serializes itself it records each of the type IDs [2] so it can easily
> map
> > back to them.
> >
> > [1] https://github.com/apache/arrow/pull/987
> > [2]
> > https://github.com/apache/arrow/blob/73d379f4631cd3013371f60876a526
> 15171e6c3b/java/vector/src/main/codegen/templates/UnionVector.java#L329
> >
> > On Wed, Mar 20, 2019 at 1:08 AM Paul Taylor <[email protected]> wrote:
> >
> > > I noticed the the DenseUnion docs[1] says the typeIds buffer is 8-bit
> > > signed integers, but in the flatbuffer schema[2] it's typed as int (and
> > > flatc generates a function that returns an Int32Array).
> > >
> > > How are the other implementations treating this buffer, and should we
> > > update the docs or the flatbuffers schema?
> > >
> > > Thanks,
> > >
> > > Paul
> > >
> > > 1. https://arrow.apache.org/docs/format/Layout.html#dense-union-type
> > >
> > > 2.
> > >
> > > https://github.com/apache/arrow/blob/50bc9f49671afb56594910f49b9bf3
> 4e080a70e7/format/Schema.fbs#L92
> > >
> > >
>

Reply via email to