I think one of Arrow's initial design goals should be simplicity of
implementation of the spec.   We can always make things more
complicated in the future.

This leads me to prefer a fixed size.   Wes (or anyone else) in
practice have you seen a union of structs with more then 127 members?

I would vote for int8_t for the types array for unions and letting
consumers of Arrow nest Unions at the application layer if they need
more slots.


On Fri, Apr 8, 2016 at 8:33 AM, Wes McKinney <w...@cloudera.com> wrote:
> On Fri, Apr 8, 2016 at 8:07 AM, Jacques Nadeau <jacq...@apache.org> wrote:
>>>
>>>
>>> > I believe this choice was primarily about simplifying the code (similar
>>> to why we have a n+1
>>> > offsets instead of just n in the list/varchar representations (even
>>> though n=0 is always 0)). In both
>>> > situations, you don't have to worry about writing special code (and a
>>> condition) for the boundary
>>> > condition inside tight loops (e.g. the last few bytes need to be handled
>>> differently since they
>>> > aren't word width).
>>>
>>> Sounds reasonable.  It might be worth illustrating this with a
>>> concrete example.  One scenario that this scheme seems useful for is a
>>> creating a new bitmap based on evaluating a predicate (i.e. all
>>> elements >X).  In this case would it make sense to make it a multiple
>>> of 16, so we can consistently use SIMD instructions for the logical
>>> "and" operation?
>>>
>>
>> Hmm... interesting thought. I'd have to look but I also recall some of the
>> newer stuff supporting even wider widths. What do others think?
>>
>>
>>> I think the spec is slightly inconsistent.  It says there is 6 bytes
>>> of overhead per entry but then follows: "with the smallest byte width
>>> capable of representing the number of types in the union."  I'm
>>> perfectly happy to say it is always 1, always 2, or always capped at
>>> 2.  I agree 32K/64K+ types is a very unlikely scenario.  We just need
>>> to clear up the ambiguity.
>>>
>>
>> Agreed. Do you want to propose an approach & patch to clarify?
>
> I can also take responsibility for the ambiguity here. My preference
> is to use int16_t for the types array (memory suitably aligned), but
> as 1 byte will be sufficient nearly all of the time, it's a slight
> trade-off in memory use vs. code complexity, e.g.
>
> if (children_.size() < 128) {
>   // types is only 1 byte
> } else {
>   // types is 2 bytes
> }
>
> Realistically there won't be that many affected code paths, so I'm
> comfortable with either choice (2-bytes always, or 1 or 2 bytes
> depending on the size of the union).

Reply via email to