hi Antoine, It's a good question.
The intent when we wrote the specification was to be strictly monotonic, but there seems nothing especially harmful about relaxing the constraint to allow for repeated values or even non-monotonicity (strict or otherwise). For example, if we had the union ['a', 'a', 'a', 0, 1, 'b', 'b'] then this could be represented as type_ids: [0, 0, 0, 1, 1, 0, 0] offsets: [0, 0, 0, 0, 1, 1, 1] child[0]: ['a', 'b'] child[1]: [0, 1] or type_ids: [0, 0, 0, 1, 1, 0, 0] offsets: [1, 1, 1, 0, 1, 0, 0] child[0]: ['b', 'a'] child[1]: [0, 1] What do others think? Either way some clarification in the specification would be useful. Because the code used to do random access is the same in all cases, I feel weakly supportive of removing constraints on the offsets. - Wes On Thu, Nov 21, 2019 at 9:04 AM Antoine Pitrou <anto...@python.org> wrote: > > > Hello, > > I'd like some clarification on the spec and intent for dense arrays. > > Currently, it is specified that offsets of a dense union are "in order / > increasing" (*). However, it is not obvious whether repeated values are > allowed or not. > > I suspect the intent is to avoid having people exploit unions as some > kind of poor man's dictionaries. Also, perhaps some optimizations are > possible if monotonic or strictly monotonic indices are assumed? But I > don't know the history behind the union type. > > Regards > > Antoine. > > > (*) https://arrow.apache.org/docs/format/Columnar.html#dense-union