An additional data-point, it looks like Apache Hive also uses one byte
for unions: 
https://github.com/apache/hive/blob/26b5c7b56a4f28ce3eabc0207566cce46b29b558/serde/src/java/org/apache/hadoop/hive/serde2/lazybinary/LazyBinaryUnion.java

On Fri, Apr 8, 2016 at 8:21 PM, Micah Kornfield <emkornfi...@gmail.com> wrote:
> I think one of Arrow's initial design goals should be simplicity of
> implementation of the spec.   We can always make things more
> complicated in the future.
>
> This leads me to prefer a fixed size.   Wes (or anyone else) in
> practice have you seen a union of structs with more then 127 members?
>
> I would vote for int8_t for the types array for unions and letting
> consumers of Arrow nest Unions at the application layer if they need
> more slots.
>
>
> On Fri, Apr 8, 2016 at 8:33 AM, Wes McKinney <w...@cloudera.com> wrote:
>> On Fri, Apr 8, 2016 at 8:07 AM, Jacques Nadeau <jacq...@apache.org> wrote:
>>>>
>>>>
>>>> > I believe this choice was primarily about simplifying the code (similar
>>>> to why we have a n+1
>>>> > offsets instead of just n in the list/varchar representations (even
>>>> though n=0 is always 0)). In both
>>>> > situations, you don't have to worry about writing special code (and a
>>>> condition) for the boundary
>>>> > condition inside tight loops (e.g. the last few bytes need to be handled
>>>> differently since they
>>>> > aren't word width).
>>>>
>>>> Sounds reasonable.  It might be worth illustrating this with a
>>>> concrete example.  One scenario that this scheme seems useful for is a
>>>> creating a new bitmap based on evaluating a predicate (i.e. all
>>>> elements >X).  In this case would it make sense to make it a multiple
>>>> of 16, so we can consistently use SIMD instructions for the logical
>>>> "and" operation?
>>>>
>>>
>>> Hmm... interesting thought. I'd have to look but I also recall some of the
>>> newer stuff supporting even wider widths. What do others think?
>>>
>>>
>>>> I think the spec is slightly inconsistent.  It says there is 6 bytes
>>>> of overhead per entry but then follows: "with the smallest byte width
>>>> capable of representing the number of types in the union."  I'm
>>>> perfectly happy to say it is always 1, always 2, or always capped at
>>>> 2.  I agree 32K/64K+ types is a very unlikely scenario.  We just need
>>>> to clear up the ambiguity.
>>>>
>>>
>>> Agreed. Do you want to propose an approach & patch to clarify?
>>
>> I can also take responsibility for the ambiguity here. My preference
>> is to use int16_t for the types array (memory suitably aligned), but
>> as 1 byte will be sufficient nearly all of the time, it's a slight
>> trade-off in memory use vs. code complexity, e.g.
>>
>> if (children_.size() < 128) {
>>   // types is only 1 byte
>> } else {
>>   // types is 2 bytes
>> }
>>
>> Realistically there won't be that many affected code paths, so I'm
>> comfortable with either choice (2-bytes always, or 1 or 2 bytes
>> depending on the size of the union).

Reply via email to