Hello,

The format spec and the C++ implementation disagree on one point:

* The spec says that dense union offsets should be increasing:
"""The respective offsets for each child value array must be in order /
increasing."""

(from https://arrow.apache.org/docs/format/Columnar.html#dense-union)

* The C++ implementation has long had some tests that used deliberatly
non-increasing (even descending) dense union offsets.

(see https://issues.apache.org/jira/browse/ARROW-10580)

I don't know what other implementations, especially Java, expect.

There are obviously two possible solutions:

1) Fix the C++ implementation and its tests to conform to the format
spec (which may break compatibility for code producing / consuming dense
unions with non-increasing offsets)

2) Relax the format spec to allow arbitrary offsets (which could make
dense union more like a polymorphic dictionary).

If the first solution is chosen, then another question arises: must the
offsets be strictly increasing?  Or can a given offset appear several
times in a row?
(the latter is currently exploited by the C++ implementation: when
appending several nulls to a DenseUnionBuilder, only one child null slot
is added and the same offset is appended multiple times)

Regards

Antoine.

Reply via email to