I think one of Arrow's initial design goals should be simplicity of implementation of the spec. We can always make things more complicated in the future.
This leads me to prefer a fixed size. Wes (or anyone else) in practice have you seen a union of structs with more then 127 members? I would vote for int8_t for the types array for unions and letting consumers of Arrow nest Unions at the application layer if they need more slots. On Fri, Apr 8, 2016 at 8:33 AM, Wes McKinney <w...@cloudera.com> wrote: > On Fri, Apr 8, 2016 at 8:07 AM, Jacques Nadeau <jacq...@apache.org> wrote: >>> >>> >>> > I believe this choice was primarily about simplifying the code (similar >>> to why we have a n+1 >>> > offsets instead of just n in the list/varchar representations (even >>> though n=0 is always 0)). In both >>> > situations, you don't have to worry about writing special code (and a >>> condition) for the boundary >>> > condition inside tight loops (e.g. the last few bytes need to be handled >>> differently since they >>> > aren't word width). >>> >>> Sounds reasonable. It might be worth illustrating this with a >>> concrete example. One scenario that this scheme seems useful for is a >>> creating a new bitmap based on evaluating a predicate (i.e. all >>> elements >X). In this case would it make sense to make it a multiple >>> of 16, so we can consistently use SIMD instructions for the logical >>> "and" operation? >>> >> >> Hmm... interesting thought. I'd have to look but I also recall some of the >> newer stuff supporting even wider widths. What do others think? >> >> >>> I think the spec is slightly inconsistent. It says there is 6 bytes >>> of overhead per entry but then follows: "with the smallest byte width >>> capable of representing the number of types in the union." I'm >>> perfectly happy to say it is always 1, always 2, or always capped at >>> 2. I agree 32K/64K+ types is a very unlikely scenario. We just need >>> to clear up the ambiguity. >>> >> >> Agreed. Do you want to propose an approach & patch to clarify? > > I can also take responsibility for the ambiguity here. My preference > is to use int16_t for the types array (memory suitably aligned), but > as 1 byte will be sufficient nearly all of the time, it's a slight > trade-off in memory use vs. code complexity, e.g. > > if (children_.size() < 128) { > // types is only 1 byte > } else { > // types is 2 bytes > } > > Realistically there won't be that many affected code paths, so I'm > comfortable with either choice (2-bytes always, or 1 or 2 bytes > depending on the size of the union).