Hmm, I also thought the intention was monotonically increasing. I can't think of a strong reason one way or another. If the argument about code to do random access is the same in all cases, is there any benefit to forcing any order at all? Memory prefetching?
On Thu, Nov 21, 2019 at 11:48 AM Wes McKinney <[email protected]> wrote: > hi Antoine, > > It's a good question. > > The intent when we wrote the specification was to be strictly > monotonic, but there seems nothing especially harmful about relaxing > the constraint to allow for repeated values or even non-monotonicity > (strict or otherwise). For example, if we had the union > > ['a', 'a', 'a', 0, 1, 'b', 'b'] > > then this could be represented as > > type_ids: [0, 0, 0, 1, 1, 0, 0] > offsets: [0, 0, 0, 0, 1, 1, 1] > child[0]: ['a', 'b'] > child[1]: [0, 1] > > or > > type_ids: [0, 0, 0, 1, 1, 0, 0] > offsets: [1, 1, 1, 0, 1, 0, 0] > child[0]: ['b', 'a'] > child[1]: [0, 1] > > What do others think? Either way some clarification in the > specification would be useful. Because the code used to do random > access is the same in all cases, I feel weakly supportive of removing > constraints on the offsets. > > - Wes > > On Thu, Nov 21, 2019 at 9:04 AM Antoine Pitrou <[email protected]> wrote: > > > > > > Hello, > > > > I'd like some clarification on the spec and intent for dense arrays. > > > > Currently, it is specified that offsets of a dense union are "in order / > > increasing" (*). However, it is not obvious whether repeated values are > > allowed or not. > > > > I suspect the intent is to avoid having people exploit unions as some > > kind of poor man's dictionaries. Also, perhaps some optimizations are > > possible if monotonic or strictly monotonic indices are assumed? But I > > don't know the history behind the union type. > > > > Regards > > > > Antoine. > > > > > > (*) https://arrow.apache.org/docs/format/Columnar.html#dense-union >
