An additional data-point, it looks like Apache Hive also uses one byte for unions: https://github.com/apache/hive/blob/26b5c7b56a4f28ce3eabc0207566cce46b29b558/serde/src/java/org/apache/hadoop/hive/serde2/lazybinary/LazyBinaryUnion.java
On Fri, Apr 8, 2016 at 8:21 PM, Micah Kornfield <emkornfi...@gmail.com> wrote: > I think one of Arrow's initial design goals should be simplicity of > implementation of the spec. We can always make things more > complicated in the future. > > This leads me to prefer a fixed size. Wes (or anyone else) in > practice have you seen a union of structs with more then 127 members? > > I would vote for int8_t for the types array for unions and letting > consumers of Arrow nest Unions at the application layer if they need > more slots. > > > On Fri, Apr 8, 2016 at 8:33 AM, Wes McKinney <w...@cloudera.com> wrote: >> On Fri, Apr 8, 2016 at 8:07 AM, Jacques Nadeau <jacq...@apache.org> wrote: >>>> >>>> >>>> > I believe this choice was primarily about simplifying the code (similar >>>> to why we have a n+1 >>>> > offsets instead of just n in the list/varchar representations (even >>>> though n=0 is always 0)). In both >>>> > situations, you don't have to worry about writing special code (and a >>>> condition) for the boundary >>>> > condition inside tight loops (e.g. the last few bytes need to be handled >>>> differently since they >>>> > aren't word width). >>>> >>>> Sounds reasonable. It might be worth illustrating this with a >>>> concrete example. One scenario that this scheme seems useful for is a >>>> creating a new bitmap based on evaluating a predicate (i.e. all >>>> elements >X). In this case would it make sense to make it a multiple >>>> of 16, so we can consistently use SIMD instructions for the logical >>>> "and" operation? >>>> >>> >>> Hmm... interesting thought. I'd have to look but I also recall some of the >>> newer stuff supporting even wider widths. What do others think? >>> >>> >>>> I think the spec is slightly inconsistent. It says there is 6 bytes >>>> of overhead per entry but then follows: "with the smallest byte width >>>> capable of representing the number of types in the union." I'm >>>> perfectly happy to say it is always 1, always 2, or always capped at >>>> 2. I agree 32K/64K+ types is a very unlikely scenario. We just need >>>> to clear up the ambiguity. >>>> >>> >>> Agreed. Do you want to propose an approach & patch to clarify? >> >> I can also take responsibility for the ambiguity here. My preference >> is to use int16_t for the types array (memory suitably aligned), but >> as 1 byte will be sufficient nearly all of the time, it's a slight >> trade-off in memory use vs. code complexity, e.g. >> >> if (children_.size() < 128) { >> // types is only 1 byte >> } else { >> // types is 2 bytes >> } >> >> Realistically there won't be that many affected code paths, so I'm >> comfortable with either choice (2-bytes always, or 1 or 2 bytes >> depending on the size of the union).