hi Antoine,

It's a good question.

The intent when we wrote the specification was to be strictly
monotonic, but there seems nothing especially harmful about relaxing
the constraint to allow for repeated values or even non-monotonicity
(strict or otherwise). For example, if we had the union

['a', 'a', 'a', 0, 1, 'b', 'b']

then this could be represented as

type_ids: [0, 0, 0, 1, 1, 0, 0]
offsets: [0, 0, 0, 0, 1, 1, 1]
child[0]: ['a', 'b']
child[1]: [0, 1]

or

type_ids: [0, 0, 0, 1, 1, 0, 0]
offsets: [1, 1, 1, 0, 1, 0, 0]
child[0]: ['b', 'a']
child[1]: [0, 1]

What do others think? Either way some clarification in the
specification would be useful. Because the code used to do random
access is the same in all cases, I feel weakly supportive of removing
constraints on the offsets.

- Wes

On Thu, Nov 21, 2019 at 9:04 AM Antoine Pitrou <anto...@python.org> wrote:
>
>
> Hello,
>
> I'd like some clarification on the spec and intent for dense arrays.
>
> Currently, it is specified that offsets of a dense union are "in order /
> increasing" (*).  However, it is not obvious whether repeated values are
> allowed or not.
>
> I suspect the intent is to avoid having people exploit unions as some
> kind of poor man's dictionaries.  Also, perhaps some optimizations are
> possible if monotonic or strictly monotonic indices are assumed?  But I
> don't know the history behind the union type.
>
> Regards
>
> Antoine.
>
>
> (*) https://arrow.apache.org/docs/format/Columnar.html#dense-union

Reply via email to