On Tue, Jun 18, 2019 at 6:38 AM Ben Kietzman <ben.kietz...@rstudio.com> wrote:
>
> I don't understand the utility of this indirection and to me it seems more
> natural to remove Union.typeIds and state that the type buffer will always
> contain the index of the corresponding child array.
>
> One use case for Union.typeIds is simplification of dropping types from the
> union. However performing that operation on an array would still require
> scanning the types buffer for elements of the dropped types and adding them
> to a new null bitmap; it's not zero copy or allocation free and seems like
> a poor justification for the added complexity.
>

I can see the argument for removing (well, deprecating since removing
fields in Flatbuffers is bad) the typeIds metadata field, but it
creates a hard requirement that the union types are globally known to
all data producers. Otherwise a "unification" step can be required to
normalize the type ids. The intent of the type ids was to allow for
scenarios where unions can be combined or split without a data-rewrite
step; it's possible that these scenarios are too esoteric to burden
the average case, but unions are already a fairly specialized tool.

In any case, I'm +0 on taking any action, if there are others who have
strong opinions about it, we can have a dedicated discussion thread
about this issue in particular.

> If there is no interest in simplifying by removing this then I'll just
> create a JIRA to update the doccomments
>
> On Mon, Jun 17, 2019 at 5:20 PM Wes McKinney <wesmck...@gmail.com> wrote:
>
> > example sparse union:
> >
> > types: (int64, utf8)
> > type_ids: [0, 4]
> >
> > type buffer: [0, 0, 0, 4, 4, 4]
> >
> > child 0: [1, 2, 3, --, --, --]
> > child 1: [--, --, --, 'foo', 'bar', 'baz']
> >
> > example dense union:
> >
> > types: (int64, utf8)
> > type_ids: [0, 4]
> >
> > type buffer: [0, 0, 0, 4, 4, 4]
> > offsets buffer: [0, 1, 2, 0, 1, 2]
> >
> > child 0: [1, 2, 3]
> > child 1: ['foo', 'bar', 'baz']
> >
> > On Mon, Jun 17, 2019 at 3:50 PM Antoine Pitrou <anto...@python.org> wrote:
> > >
> > >
> > > Le 17/06/2019 à 22:46, Wes McKinney a écrit :
> > > > https://github.com/apache/arrow/blob/master/format/Schema.fbs#L88
> > > >
> > > > "optionally typeIds provides an indirection between the child offset
> > > > and the type id for each child typeIds[offset] is the id used in the
> > > > type vector"
> > >
> > > Does this mean typeIds[logical_id] gives the physical child index?
> > > or typeIds[physical_index] gives the logical type id?
> > >
> > > I don't understand what "offset" means in this context, usually it's a
> > > displacement from the beginning of an array.
> > >
> > >
> > > >
> > > > On Mon, Jun 17, 2019 at 12:26 PM Ben Kietzman <
> > ben.kietz...@rstudio.com> wrote:
> > > >>
> > > >> Somewhat related:
> > > >>
> > > >> Could we clarify the expected content of the type_ids buffer of union
> > > >> arrays? Layout.rst
> > > >> <
> > https://github.com/apache/arrow/blob/dede1e6/docs/source/format/Layout.rst#dense-union-type
> > >
> > > >> seems to indicate these should be indices of the corresponding child
> > array,
> > > >> but the C++ implementation allows them to be any positive int8 and
> > > >> maintains a mapping to child indices. (for example see what is
> > generated
> > > >> for integration testing
> > > >> <
> > https://github.com/apache/arrow/blob/dede1e6/cpp/src/arrow/ipc/test-common.cc#L410
> > >:
> > > >> has two child arrays and type_ids 5, 10)
> > > >>
> > > >> On Mon, Jun 17, 2019 at 11:35 AM Micah Kornfield <
> > emkornfi...@gmail.com>
> > > >> wrote:
> > > >>
> > > >>> Sounds good.  Sorry I got distracted with some other stuff but
> > should be
> > > >>> getting back to this soonish
> > > >>>
> > > >>> On Monday, June 17, 2019, Wes McKinney <wesmck...@gmail.com> wrote:
> > > >>>
> > > >>>> I'd already moved the Union issues to 1.0.0 so we are all good there
> > > >>>>
> > > >>>> On Mon, Jun 17, 2019 at 10:18 AM Wes McKinney <wesmck...@gmail.com>
> > > >>> wrote:
> > > >>>>>
> > > >>>>> I'm also +1 for generalized unions as we currently have specified.
> > The
> > > >>>>> objections from the Java users seems to be mostly on the basis of
> > > >>>>> performance in the union-of-primitives case -- that's an
> > > >>>>> implementation specific issue, so if Java needs to have a
> > > >>>>> "GeneralizedDenseUnionVector" or something to handle the
> > > >>>>> union-of-anything case, then that seems reasonable. The important
> > > >>>>> thing is that the binary protocol itself and serialized metadata is
> > > >>>>> something that we are happy with and won't need to change going
> > > >>>>> forward.
> > > >>>>>
> > > >>>>> It seems we're getting a bit long in the tooth to get this into
> > 0.14.0
> > > >>>>> so I'm going to move the Union-related issues to the 1.0.0
> > milestone
> > > >>>>> so we can get this resolved for the 1.0.0 release
> > > >>>>>
> > > >>>>> Thanks
> > > >>>>> Wes
> > > >>>>>
> > > >>>>> On Mon, Jun 10, 2019 at 12:33 AM Ravindra Pindikura <
> > > >>> ravin...@dremio.com>
> > > >>>> wrote:
> > > >>>>>>
> > > >>>>>> On Sat, May 25, 2019 at 12:29 PM Micah Kornfield <
> > > >>>> emkornfi...@gmail.com>
> > > >>>>>> wrote:
> > > >>>>>>
> > > >>>>>>> Thanks for the responses, I've clipped the questions and provided
> > > >>>> responses
> > > >>>>>>> inline.
> > > >>>>>>>
> > > >>>>>>> is the proposal that both cpp & java will support only option 2 ?
> > > >>>>>>>> I guess 1 is a subset of 2 anyway.
> > > >>>>>>>
> > > >>>>>>> CPP already supports option 2.  I would like to make CPP and java
> > > >>>>>>> compatible, in a way that this acceptable for Java maintainers.
> > > >>>> Yes, 1 is
> > > >>>>>>> a subset of 2.
> > > >>>>>>>
> > > >>>>>>> The metadata on java side uses the minor type id as the type
> > > >>>> identifier in
> > > >>>>>>>> the union (and the field name is expected to the the same as the
> > > >>>> type
> > > >>>>>>> name
> > > >>>>>>>> in a union). If you were to support a generalized union, this
> > > >>>> wouldn't
> > > >>>>>>>> work. How will the type identifiers be generated ?
> > > >>>>>>>> I'm trying to see if we can make the change backward compatible,
> > > >>>> with
> > > >>>>>>>> existing unions in java.
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>> Looking at the code, I don't think the existing Union class is
> > > >>>>>>> generalizable because of this assumption (it caches a single type
> > > >>> of
> > > >>>> each
> > > >>>>>>> locally) and based on the Javadoc this seems to be for
> > performance
> > > >>>> reasons,
> > > >>>>>>> so I would like to try to avoid touching it if possible.
> > > >>>>>>>
> > > >>>>>>
> > > >>>>>> @Micah Kornfield <emkornfi...@gmail.com> sorry, I wasn't clear. I
> > > >>>> meant to
> > > >>>>>> ask if the format would be backward compatible, which I think it
> > will
> > > >>>> be
> > > >>>>>> (since 1 is a subset of 2, and your proposal isn't making a
> > change in
> > > >>>> the
> > > >>>>>> wire format).
> > > >>>>>>
> > > >>>>>> I'm fine if the APIs are not backward compatible. Or, once we
> > have 2,
> > > >>>> we
> > > >>>>>> can add wrappers for 1, if required.
> > > >>>>>>
> > > >>>>>>
> > > >>>>>>>
> > > >>>>>>> My original thinking was to try to factor out a common base class
> > > >>>> from
> > > >>>>>>> UnionVector, then create a new GeneralizedUnionVector class that
> > > >>> has
> > > >>>>>>> slightly different method signatures (its possible I will need
> > > >>>> additional
> > > >>>>>>> supporting classes like a new GeneralizedUnionWriter, but i
> > haven't
> > > >>>> gotten
> > > >>>>>>> that far yet).  The main challenge I see is a way to let users
> > > >>> switch
> > > >>>>>>> between the two implementations.  Some thoughts off the top of my
> > > >>>> head
> > > >>>>>>> (none of them seem good):
> > > >>>>>>> 1.  Create a flag like:
> > > >>>>>>>
> > > >>>>>>>
> > > >>> https://github.com/apache/arrow/blob/ccdaa9f2a4c1af1222df840b608e2e
> > > >>>> f465d331fc/java/memory/src/main/java/org/apache/arrow/
> > > >>>> memory/BoundsChecking.java
> > > >>>>>>> so
> > > >>>>>>> it is statically decided before hand, and have the new class
> > > >>>> implement the
> > > >>>>>>> same signatures as UnionVector to but throw an exception if a
> > > >>> method
> > > >>>> that
> > > >>>>>>> isn't compatible is called.
> > > >>>>>>> 2. Possibly try to augment ArrowType to pass through information
> > > >>>> about its
> > > >>>>>>> children vectors when reading vectors, but use the flag in
> > option 1
> > > >>>> if it
> > > >>>>>>> can't be determined.
> > > >>>>>>>
> > > >>>>>>> I'm open to suggestions.  I'm also happy to try to prototype
> > > >>>> something and
> > > >>>>>>> get feedback once there is concrete code to evaluate.
> > > >>>>>>>
> > > >>>>>>> I don't understand the limitation to different types, so +1 for
> > > >>>>>>>> generalized unions.  That said, I don't think it's high-priority
> > > >>>> either.
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>> Antoine, the fact that it isn't high-priority probably  is why it
> > > >>>> has taken
> > > >>>>>>> so long to resolve.  I'm excited to get to more interesting
> > higher
> > > >>>> priority
> > > >>>>>>> work, but I would like to see some of the basics finished off
> > > >>>> first.  BTW,
> > > >>>>>>> if you have suggestions on priorities, I'd be happy to hear them.
> > > >>>> After
> > > >>>>>>> this, handling empty record batches, and the UBSan work I'm
> > doing,
> > > >>> I
> > > >>>> was
> > > >>>>>>> thinking of either trying to get Avro support in, or work on fit
> > > >>> and
> > > >>>> finish
> > > >>>>>>> items for the C++/Python CSV reader.
> > > >>>>>>>
> > > >>>>>>> Thanks,
> > > >>>>>>> Micah
> > > >>>>>>>
> > > >>>>>>> [1]
> > > >>> https://github.com/apache/arrow/pull/987#issuecomment-493231493
> > > >>>>>>> [2]
> > > >>>>>>>
> > > >>>>>>>
> > > >>> https://github.com/apache/arrow/blob/7b2d68570b4336308c52081a034967
> > > >>>> 5e488caf11/java/vector/src/main/java/org/apache/arrow/
> > > >>>> vector/types/pojo/Field.java#L104
> > > >>>>>>>
> > > >>>>>>> On Fri, May 24, 2019 at 2:08 AM Antoine Pitrou <
> > anto...@python.org
> > > >>>>
> > > >>>> wrote:
> > > >>>>>>>
> > > >>>>>>>>
> > > >>>>>>>> I don't understand the limitation to different types, so +1 for
> > > >>>>>>>> generalized unions.  That said, I don't think it's high-priority
> > > >>>> either.
> > > >>>>>>>>
> > > >>>>>>>> Regards
> > > >>>>>>>>
> > > >>>>>>>> Antoine.
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>> Le 24/05/2019 à 04:17, Micah Kornfield a écrit :
> > > >>>>>>>>> I'd like to bump this thread, to see if anyone has any
> > > >>>> comments.  If
> > > >>>>>>>> nobody
> > > >>>>>>>>> objects I will try to start implementing the changes next week.
> > > >>>>>>>>>
> > > >>>>>>>>> Thanks,
> > > >>>>>>>>> Micah
> > > >>>>>>>>>
> > > >>>>>>>>> On Mon, May 20, 2019 at 9:37 PM Micah Kornfield <
> > > >>>> emkornfi...@gmail.com
> > > >>>>>>>>
> > > >>>>>>>>> wrote:
> > > >>>>>>>>>
> > > >>>>>>>>>> In the past [1] there hasn't been agreement on the final
> > > >>>> requirements
> > > >>>>>>>> for
> > > >>>>>>>>>> union types.
> > > >>>>>>>>>>
> > > >>>>>>>>>> Briefly the two approaches that are currently advocated:
> > > >>>>>>>>>> 1.  Limit unions to only contain one field of each individual
> > > >>>> type
> > > >>>>>>> (e.g.
> > > >>>>>>>>>> you can't have two separate int32 fields).  Java takes this
> > > >>>> approach.
> > > >>>>>>>>>> 2.  Generalized unions (unions can have any number of fields
> > > >>>> with the
> > > >>>>>>>> same
> > > >>>>>>>>>> type).  C++ takes this approach.
> > > >>>>>>>>>>
> > > >>>>>>>>>> There was a prior PR [2] that stalled in trying to take this
> > > >>>> approach
> > > >>>>>>>> with
> > > >>>>>>>>>> Java.  For writing vectors it seemed to be slower on a
> > > >>>> benchmark.
> > > >>>>>>>>>>
> > > >>>>>>>>>> My proposal:  We should pursue option 2 (the general
> > > >>>> approach).  There
> > > >>>>>>>> are
> > > >>>>>>>>>> already data interchange formats that support it and it would
> > > >>>> be nice
> > > >>>>>>>> to a
> > > >>>>>>>>>> data-model that lets us make the translation between Arrow
> > > >>>> schemas
> > > >>>>>>> easy:
> > > >>>>>>>>>> 1.  Avro Seems to support it [3] (with the exception of
> > > >>> complex
> > > >>>> types)
> > > >>>>>>>>>> 2.  Protobufs loosely support it [4] via one-of.
> > > >>>>>>>>>>
> > > >>>>>>>>>> In order to address issues in [2], I propose the following
> > > >>>> making the
> > > >>>>>>>>>> changes/additions to the Java implementation:
> > > >>>>>>>>>> 1.  Keep the default write-path untouched with the existing
> > > >>>> class.
> > > >>>>>>>>>> 2.  Add in a new sparse union class that implements the same
> > > >>>> interface
> > > >>>>>>>>>> that can be used on the read path, and if a client opts in
> > > >>> (via
> > > >>>> direct
> > > >>>>>>>>>> construction).
> > > >>>>>>>>>> 3.  Add in a dense union class (I don't believe Java has one).
> > > >>>>>>>>>>
> > > >>>>>>>>>> I'm still ramping up the Java code base, so I'd like other
> > > >>> Java
> > > >>>>>>>>>> contributors to chime in to see if this plan sounds feasible
> > > >>> and
> > > >>>>>>>> acceptable.
> > > >>>>>>>>>>
> > > >>>>>>>>>> Any other thoughts on Unions?
> > > >>>>>>>>>>
> > > >>>>>>>>>> Thanks,
> > > >>>>>>>>>> Micah
> > > >>>>>>>>>>
> > > >>>>>>>>>> [1]
> > > >>>>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>
> > > >>> https://lists.apache.org/thread.html/82ec2049fc3c29de232c9c6962aaee
> > > >>>> 9ec022d581cecb6cf0eb6a8f36@%3Cdev.arrow.apache.org%3E
> > > >>>>>>>>>> [2] https://github.com/apache/arrow/pull/987#issuecomment-
> > > >>>> 493231493
> > > >>>>>>>>>> [3] https://github.com/apache/arrow/pull/987#issuecomment-
> > > >>>> 493231493
> > > >>>>>>>>>> [4]
> > > >>> https://developers.google.com/protocol-buffers/docs/proto#
> > > >>>> oneof
> > > >>>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>
> > > >>>>>>
> > > >>>>>>
> > > >>>>>> --
> > > >>>>>> Thanks and regards,
> > > >>>>>> Ravindra.
> > > >>>>
> > > >>>
> >

Reply via email to