Last time this was discussed [1] I think we determined the specification was written as intended and Wes mentioned there that he was also weakly supportive of removing the constraint.
>From a previous discussion [2], it sounds like users of JS library were explicitly using the "dictionary" feature (i.e. not necessarily conformant with the spec). [1] https://lists.apache.org/thread.html/71e0bb9d4f57ccf2092cab8af5827f24c3cee87b02b822fc1016c4b3%40%3Cdev.arrow.apache.org%3E [2] https://lists.apache.org/thread.html/82ec2049fc3c29de232c9c6962aaee9ec022d581cecb6cf0eb6a8f36%40%3Cdev.arrow.apache.org%3E On Thu, Nov 19, 2020 at 12:58 AM Fan Liya <liya.fa...@gmail.com> wrote: > I think the Java implementation is not aligning with the spec, either. > > IMO, option 2 provides more performance optimization opportunities. > However, it may lead to some unexpected behaviors. For example, when we > change the value of one slot, the values of several other slots may be > changed as well. > > In general, I prefer option 2. > > Best, > Liya Fan > > > > On Tue, Nov 17, 2020 at 11:37 PM Wes McKinney <wesmck...@gmail.com> wrote: > > > In principle I'm in favor of #2 -- the only question is what kinds of > > problems it might pose for forward compatibility. > > > > Note > > > > * This is completely backward compatible (any data conforming to the > > spec to the letter will continue to be conforming) > > * It is also forward compatible at a protocol level, but code that > > makes assumptions about the monotonicity of the offsets will break > > > > Since the offset acts effectively as a dictionary index, this doesn't > > strike me as being so harmful, but I'm interested in the opinions of > > others > > > > On Tue, Nov 17, 2020 at 5:28 AM Antoine Pitrou <anto...@python.org> > wrote: > > > > > > > > > Hello, > > > > > > The format spec and the C++ implementation disagree on one point: > > > > > > * The spec says that dense union offsets should be increasing: > > > """The respective offsets for each child value array must be in order / > > > increasing.""" > > > > > > (from https://arrow.apache.org/docs/format/Columnar.html#dense-union) > > > > > > * The C++ implementation has long had some tests that used deliberatly > > > non-increasing (even descending) dense union offsets. > > > > > > (see https://issues.apache.org/jira/browse/ARROW-10580) > > > > > > I don't know what other implementations, especially Java, expect. > > > > > > There are obviously two possible solutions: > > > > > > 1) Fix the C++ implementation and its tests to conform to the format > > > spec (which may break compatibility for code producing / consuming > dense > > > unions with non-increasing offsets) > > > > > > 2) Relax the format spec to allow arbitrary offsets (which could make > > > dense union more like a polymorphic dictionary). > > > > > > If the first solution is chosen, then another question arises: must the > > > offsets be strictly increasing? Or can a given offset appear several > > > times in a row? > > > (the latter is currently exploited by the C++ implementation: when > > > appending several nulls to a DenseUnionBuilder, only one child null > slot > > > is added and the same offset is appended multiple times) > > > > > > Regards > > > > > > Antoine. > > >