I think the Java implementation is not aligning with the spec, either. IMO, option 2 provides more performance optimization opportunities. However, it may lead to some unexpected behaviors. For example, when we change the value of one slot, the values of several other slots may be changed as well.
In general, I prefer option 2. Best, Liya Fan On Tue, Nov 17, 2020 at 11:37 PM Wes McKinney <wesmck...@gmail.com> wrote: > In principle I'm in favor of #2 -- the only question is what kinds of > problems it might pose for forward compatibility. > > Note > > * This is completely backward compatible (any data conforming to the > spec to the letter will continue to be conforming) > * It is also forward compatible at a protocol level, but code that > makes assumptions about the monotonicity of the offsets will break > > Since the offset acts effectively as a dictionary index, this doesn't > strike me as being so harmful, but I'm interested in the opinions of > others > > On Tue, Nov 17, 2020 at 5:28 AM Antoine Pitrou <anto...@python.org> wrote: > > > > > > Hello, > > > > The format spec and the C++ implementation disagree on one point: > > > > * The spec says that dense union offsets should be increasing: > > """The respective offsets for each child value array must be in order / > > increasing.""" > > > > (from https://arrow.apache.org/docs/format/Columnar.html#dense-union) > > > > * The C++ implementation has long had some tests that used deliberatly > > non-increasing (even descending) dense union offsets. > > > > (see https://issues.apache.org/jira/browse/ARROW-10580) > > > > I don't know what other implementations, especially Java, expect. > > > > There are obviously two possible solutions: > > > > 1) Fix the C++ implementation and its tests to conform to the format > > spec (which may break compatibility for code producing / consuming dense > > unions with non-increasing offsets) > > > > 2) Relax the format spec to allow arbitrary offsets (which could make > > dense union more like a polymorphic dictionary). > > > > If the first solution is chosen, then another question arises: must the > > offsets be strictly increasing? Or can a given offset appear several > > times in a row? > > (the latter is currently exploited by the C++ implementation: when > > appending several nulls to a DenseUnionBuilder, only one child null slot > > is added and the same offset is appended multiple times) > > > > Regards > > > > Antoine. >