Last time this was discussed [1] I think we determined the specification
was written as intended and Wes mentioned there that he was also weakly
supportive of removing the constraint.

>From a previous discussion [2], it sounds like users of JS library were
explicitly using the "dictionary" feature (i.e. not necessarily conformant
with the spec).

[1]
https://lists.apache.org/thread.html/71e0bb9d4f57ccf2092cab8af5827f24c3cee87b02b822fc1016c4b3%40%3Cdev.arrow.apache.org%3E

[2]
https://lists.apache.org/thread.html/82ec2049fc3c29de232c9c6962aaee9ec022d581cecb6cf0eb6a8f36%40%3Cdev.arrow.apache.org%3E




On Thu, Nov 19, 2020 at 12:58 AM Fan Liya <liya.fa...@gmail.com> wrote:

> I think the Java implementation is not aligning with the spec, either.
>
> IMO, option 2 provides more performance optimization opportunities.
> However, it may lead to some unexpected behaviors. For example, when we
> change the value of one slot, the values of several other slots may be
> changed as well.
>
> In general, I prefer option 2.
>
> Best,
> Liya Fan
>
>
>
> On Tue, Nov 17, 2020 at 11:37 PM Wes McKinney <wesmck...@gmail.com> wrote:
>
> > In principle I'm in favor of #2 -- the only question is what kinds of
> > problems it might pose for forward compatibility.
> >
> > Note
> >
> > * This is completely backward compatible (any data conforming to the
> > spec to the letter will continue to be conforming)
> > * It is also forward compatible at a protocol level, but code that
> > makes assumptions about the monotonicity of the offsets will break
> >
> > Since the offset acts effectively as a dictionary index, this doesn't
> > strike me as being so harmful, but I'm interested in the opinions of
> > others
> >
> > On Tue, Nov 17, 2020 at 5:28 AM Antoine Pitrou <anto...@python.org>
> wrote:
> > >
> > >
> > > Hello,
> > >
> > > The format spec and the C++ implementation disagree on one point:
> > >
> > > * The spec says that dense union offsets should be increasing:
> > > """The respective offsets for each child value array must be in order /
> > > increasing."""
> > >
> > > (from https://arrow.apache.org/docs/format/Columnar.html#dense-union)
> > >
> > > * The C++ implementation has long had some tests that used deliberatly
> > > non-increasing (even descending) dense union offsets.
> > >
> > > (see https://issues.apache.org/jira/browse/ARROW-10580)
> > >
> > > I don't know what other implementations, especially Java, expect.
> > >
> > > There are obviously two possible solutions:
> > >
> > > 1) Fix the C++ implementation and its tests to conform to the format
> > > spec (which may break compatibility for code producing / consuming
> dense
> > > unions with non-increasing offsets)
> > >
> > > 2) Relax the format spec to allow arbitrary offsets (which could make
> > > dense union more like a polymorphic dictionary).
> > >
> > > If the first solution is chosen, then another question arises: must the
> > > offsets be strictly increasing?  Or can a given offset appear several
> > > times in a row?
> > > (the latter is currently exploited by the C++ implementation: when
> > > appending several nulls to a DenseUnionBuilder, only one child null
> slot
> > > is added and the same offset is appended multiple times)
> > >
> > > Regards
> > >
> > > Antoine.
> >
>

Reply via email to