Thanks @Antoine/@Weston - we've raised an issue [1] for the same in Arrow
Java as suggested.

Cheers,

James

[1]: https://github.com/apache/arrow/issues/40951

On Tue, 2 Apr 2024 at 14:29, Finn Völkel <f...@juxt.pro> wrote:

> @weston I think my mentioning of ADT was a mistake. I am just thinking of
> sum types (https://en.wikipedia.org/wiki/Tagged_union) which I should have
> just called differently.
> You are thinking of a product type which is better represented by a
> StructVector with nullable child vectors.
>
> @antoine Thanks for the clarification.
>
> On Tue, 2 Apr 2024 at 14:47, Weston Pace <weston.p...@gmail.com> wrote:
>
> > Wouldn't support for ADT require expressing more than 1 type id per
> > record?  In other words, if `put` has type id 1, `delete` has type id 2,
> > and `erase` has type id 3 then there is no way to express something is
> (for
> > example) both type id 1 and type id 3 because you can only have one type
> id
> > per record.
> >
> > If that understanding is correct then it seems you can always encode
> world
> > 2 into world 1 by exhaustively listing out the combinations.  In other
> > words, `put` is the LSB, `delete` is bit 2, and `erase` is bit 3 and you
> > have:
> >
> > 7 - put/delete/erase
> > 6 - delete/erase
> > 5 - erase/put
> > 4 - erase
> > 3 - put/delete
> > 2 - delete
> > 1 - put
> >
> > On Tue, Apr 2, 2024 at 4:36 AM Finn Völkel <f...@juxt.pro> wrote:
> >
> > > I also meant Algebraic Data Type not Abstract Data Type (too many
> > > acronymns).
> > >
> > > On Tue, 2 Apr 2024 at 13:28, Antoine Pitrou <anto...@python.org>
> wrote:
> > >
> > > >
> > > > Thanks. The Arrow spec does support multiple union members with the
> > same
> > > > type, but not all implementations do. The C++ implementation should
> > > > support it, though to my surprise we do not seem to have any tests
> for
> > > it.
> > > >
> > > > If the Java implementation doesn't, then you can probably open an
> issue
> > > > for it (and even submit a PR if you would like to tackle it).
> > > >
> > > > I've also opened https://github.com/apache/arrow/issues/40947 to
> > create
> > > > integration tests for this.
> > > >
> > > > Regards
> > > >
> > > > Antoine.
> > > >
> > > >
> > > > Le 02/04/2024 à 13:19, Finn Völkel a écrit :
> > > > >> Can you explain what ADT means ?
> > > > >
> > > > > Sorry about that. ADT stands for Abstract Data Type. What do I mean
> > by
> > > an
> > > > > ADT style vector?
> > > > >
> > > > > Let's take an example from the project I am on. We have an `op`
> union
> > > > > vector with three child vectors `put`, `delete`, `erase`. `delete`
> > and
> > > > > `erase` have the same type but represent different things.
> > > > >
> > > > > On Tue, 2 Apr 2024 at 13:16, Steve Kim <chairm...@gmail.com>
> wrote:
> > > > >
> > > > >> Thank you for asking this question. I have the same question.
> > > > >>
> > > > >> I noted a similar problem in the c++/python implementation:
> > > > >>
> > https://github.com/apache/arrow/issues/19157#issuecomment-1528037394
> > > > >>
> > > > >> On Tue, Apr 2, 2024, 04:30 Finn Völkel <f...@juxt.pro> wrote:
> > > > >>
> > > > >>> Hi,
> > > > >>>
> > > > >>> my question primarily concerns the union layout described at
> > > > >>> https://arrow.apache.org/docs/format/Columnar.html#union-layout
> > > > >>>
> > > > >>> There are two ways to use unions:
> > > > >>>
> > > > >>>     - polymorphic vectors (world 1)
> > > > >>>     - ADT style vectors (world 2)
> > > > >>>
> > > > >>> In world 1 you have a vector that stores different types. In the
> > ADT
> > > > >> world
> > > > >>> you could have multiple child vectors with the same type but
> > > different
> > > > >> type
> > > > >>> ids in the union type vector. The difference is apparent if you
> > want
> > > to
> > > > >> use
> > > > >>> two BigIntVectors as children which doesn't exist in world 1.
> > World 1
> > > > is
> > > > >> a
> > > > >>> subset of world 2.
> > > > >>>
> > > > >>> The spec (to my understanding) doesn’t explicitly forbid world 2,
> > but
> > > > the
> > > > >>> implementation we have been using (Java) has been making the
> > > assumption
> > > > >> of
> > > > >>> being in world 1 (a union only having ONE child of each type). We
> > > > >> sometimes
> > > > >>> use union in the ADT style which has led to problems down the
> road.
> > > > >>>
> > > > >>> Could someone clarify what the specification allows and what it
> > > doesn’t
> > > > >>> allow? Could we tighten the specification after that
> clarification?
> > > > >>>
> > > > >>> Best, Finn
> > > > >>>
> > > > >>
> > > > >
> > > >
> > >
> >
>


-- 
*James Henderson*
XTDB Head of Engineering at *JUXT*

Mobile +44 (0) 780 4321 777 <+447804321777>
Email j...@juxt.pro
Website https://juxt.pro

[image: photo]

Reply via email to