example sparse union:

types: (int64, utf8)
type_ids: [0, 4]

type buffer: [0, 0, 0, 4, 4, 4]

child 0: [1, 2, 3, --, --, --]
child 1: [--, --, --, 'foo', 'bar', 'baz']

example dense union:

types: (int64, utf8)
type_ids: [0, 4]

type buffer: [0, 0, 0, 4, 4, 4]
offsets buffer: [0, 1, 2, 0, 1, 2]

child 0: [1, 2, 3]
child 1: ['foo', 'bar', 'baz']

On Mon, Jun 17, 2019 at 3:50 PM Antoine Pitrou <anto...@python.org> wrote:
>
>
> Le 17/06/2019 à 22:46, Wes McKinney a écrit :
> > https://github.com/apache/arrow/blob/master/format/Schema.fbs#L88
> >
> > "optionally typeIds provides an indirection between the child offset
> > and the type id for each child typeIds[offset] is the id used in the
> > type vector"
>
> Does this mean typeIds[logical_id] gives the physical child index?
> or typeIds[physical_index] gives the logical type id?
>
> I don't understand what "offset" means in this context, usually it's a
> displacement from the beginning of an array.
>
>
> >
> > On Mon, Jun 17, 2019 at 12:26 PM Ben Kietzman <ben.kietz...@rstudio.com> 
> > wrote:
> >>
> >> Somewhat related:
> >>
> >> Could we clarify the expected content of the type_ids buffer of union
> >> arrays? Layout.rst
> >> <https://github.com/apache/arrow/blob/dede1e6/docs/source/format/Layout.rst#dense-union-type>
> >> seems to indicate these should be indices of the corresponding child array,
> >> but the C++ implementation allows them to be any positive int8 and
> >> maintains a mapping to child indices. (for example see what is generated
> >> for integration testing
> >> <https://github.com/apache/arrow/blob/dede1e6/cpp/src/arrow/ipc/test-common.cc#L410>:
> >> has two child arrays and type_ids 5, 10)
> >>
> >> On Mon, Jun 17, 2019 at 11:35 AM Micah Kornfield <emkornfi...@gmail.com>
> >> wrote:
> >>
> >>> Sounds good.  Sorry I got distracted with some other stuff but should be
> >>> getting back to this soonish
> >>>
> >>> On Monday, June 17, 2019, Wes McKinney <wesmck...@gmail.com> wrote:
> >>>
> >>>> I'd already moved the Union issues to 1.0.0 so we are all good there
> >>>>
> >>>> On Mon, Jun 17, 2019 at 10:18 AM Wes McKinney <wesmck...@gmail.com>
> >>> wrote:
> >>>>>
> >>>>> I'm also +1 for generalized unions as we currently have specified. The
> >>>>> objections from the Java users seems to be mostly on the basis of
> >>>>> performance in the union-of-primitives case -- that's an
> >>>>> implementation specific issue, so if Java needs to have a
> >>>>> "GeneralizedDenseUnionVector" or something to handle the
> >>>>> union-of-anything case, then that seems reasonable. The important
> >>>>> thing is that the binary protocol itself and serialized metadata is
> >>>>> something that we are happy with and won't need to change going
> >>>>> forward.
> >>>>>
> >>>>> It seems we're getting a bit long in the tooth to get this into 0.14.0
> >>>>> so I'm going to move the Union-related issues to the 1.0.0 milestone
> >>>>> so we can get this resolved for the 1.0.0 release
> >>>>>
> >>>>> Thanks
> >>>>> Wes
> >>>>>
> >>>>> On Mon, Jun 10, 2019 at 12:33 AM Ravindra Pindikura <
> >>> ravin...@dremio.com>
> >>>> wrote:
> >>>>>>
> >>>>>> On Sat, May 25, 2019 at 12:29 PM Micah Kornfield <
> >>>> emkornfi...@gmail.com>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Thanks for the responses, I've clipped the questions and provided
> >>>> responses
> >>>>>>> inline.
> >>>>>>>
> >>>>>>> is the proposal that both cpp & java will support only option 2 ?
> >>>>>>>> I guess 1 is a subset of 2 anyway.
> >>>>>>>
> >>>>>>> CPP already supports option 2.  I would like to make CPP and java
> >>>>>>> compatible, in a way that this acceptable for Java maintainers.
> >>>> Yes, 1 is
> >>>>>>> a subset of 2.
> >>>>>>>
> >>>>>>> The metadata on java side uses the minor type id as the type
> >>>> identifier in
> >>>>>>>> the union (and the field name is expected to the the same as the
> >>>> type
> >>>>>>> name
> >>>>>>>> in a union). If you were to support a generalized union, this
> >>>> wouldn't
> >>>>>>>> work. How will the type identifiers be generated ?
> >>>>>>>> I'm trying to see if we can make the change backward compatible,
> >>>> with
> >>>>>>>> existing unions in java.
> >>>>>>>
> >>>>>>>
> >>>>>>> Looking at the code, I don't think the existing Union class is
> >>>>>>> generalizable because of this assumption (it caches a single type
> >>> of
> >>>> each
> >>>>>>> locally) and based on the Javadoc this seems to be for performance
> >>>> reasons,
> >>>>>>> so I would like to try to avoid touching it if possible.
> >>>>>>>
> >>>>>>
> >>>>>> @Micah Kornfield <emkornfi...@gmail.com> sorry, I wasn't clear. I
> >>>> meant to
> >>>>>> ask if the format would be backward compatible, which I think it will
> >>>> be
> >>>>>> (since 1 is a subset of 2, and your proposal isn't making a change in
> >>>> the
> >>>>>> wire format).
> >>>>>>
> >>>>>> I'm fine if the APIs are not backward compatible. Or, once we have 2,
> >>>> we
> >>>>>> can add wrappers for 1, if required.
> >>>>>>
> >>>>>>
> >>>>>>>
> >>>>>>> My original thinking was to try to factor out a common base class
> >>>> from
> >>>>>>> UnionVector, then create a new GeneralizedUnionVector class that
> >>> has
> >>>>>>> slightly different method signatures (its possible I will need
> >>>> additional
> >>>>>>> supporting classes like a new GeneralizedUnionWriter, but i haven't
> >>>> gotten
> >>>>>>> that far yet).  The main challenge I see is a way to let users
> >>> switch
> >>>>>>> between the two implementations.  Some thoughts off the top of my
> >>>> head
> >>>>>>> (none of them seem good):
> >>>>>>> 1.  Create a flag like:
> >>>>>>>
> >>>>>>>
> >>> https://github.com/apache/arrow/blob/ccdaa9f2a4c1af1222df840b608e2e
> >>>> f465d331fc/java/memory/src/main/java/org/apache/arrow/
> >>>> memory/BoundsChecking.java
> >>>>>>> so
> >>>>>>> it is statically decided before hand, and have the new class
> >>>> implement the
> >>>>>>> same signatures as UnionVector to but throw an exception if a
> >>> method
> >>>> that
> >>>>>>> isn't compatible is called.
> >>>>>>> 2. Possibly try to augment ArrowType to pass through information
> >>>> about its
> >>>>>>> children vectors when reading vectors, but use the flag in option 1
> >>>> if it
> >>>>>>> can't be determined.
> >>>>>>>
> >>>>>>> I'm open to suggestions.  I'm also happy to try to prototype
> >>>> something and
> >>>>>>> get feedback once there is concrete code to evaluate.
> >>>>>>>
> >>>>>>> I don't understand the limitation to different types, so +1 for
> >>>>>>>> generalized unions.  That said, I don't think it's high-priority
> >>>> either.
> >>>>>>>
> >>>>>>>
> >>>>>>> Antoine, the fact that it isn't high-priority probably  is why it
> >>>> has taken
> >>>>>>> so long to resolve.  I'm excited to get to more interesting higher
> >>>> priority
> >>>>>>> work, but I would like to see some of the basics finished off
> >>>> first.  BTW,
> >>>>>>> if you have suggestions on priorities, I'd be happy to hear them.
> >>>> After
> >>>>>>> this, handling empty record batches, and the UBSan work I'm doing,
> >>> I
> >>>> was
> >>>>>>> thinking of either trying to get Avro support in, or work on fit
> >>> and
> >>>> finish
> >>>>>>> items for the C++/Python CSV reader.
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> Micah
> >>>>>>>
> >>>>>>> [1]
> >>> https://github.com/apache/arrow/pull/987#issuecomment-493231493
> >>>>>>> [2]
> >>>>>>>
> >>>>>>>
> >>> https://github.com/apache/arrow/blob/7b2d68570b4336308c52081a034967
> >>>> 5e488caf11/java/vector/src/main/java/org/apache/arrow/
> >>>> vector/types/pojo/Field.java#L104
> >>>>>>>
> >>>>>>> On Fri, May 24, 2019 at 2:08 AM Antoine Pitrou <anto...@python.org
> >>>>
> >>>> wrote:
> >>>>>>>
> >>>>>>>>
> >>>>>>>> I don't understand the limitation to different types, so +1 for
> >>>>>>>> generalized unions.  That said, I don't think it's high-priority
> >>>> either.
> >>>>>>>>
> >>>>>>>> Regards
> >>>>>>>>
> >>>>>>>> Antoine.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Le 24/05/2019 à 04:17, Micah Kornfield a écrit :
> >>>>>>>>> I'd like to bump this thread, to see if anyone has any
> >>>> comments.  If
> >>>>>>>> nobody
> >>>>>>>>> objects I will try to start implementing the changes next week.
> >>>>>>>>>
> >>>>>>>>> Thanks,
> >>>>>>>>> Micah
> >>>>>>>>>
> >>>>>>>>> On Mon, May 20, 2019 at 9:37 PM Micah Kornfield <
> >>>> emkornfi...@gmail.com
> >>>>>>>>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> In the past [1] there hasn't been agreement on the final
> >>>> requirements
> >>>>>>>> for
> >>>>>>>>>> union types.
> >>>>>>>>>>
> >>>>>>>>>> Briefly the two approaches that are currently advocated:
> >>>>>>>>>> 1.  Limit unions to only contain one field of each individual
> >>>> type
> >>>>>>> (e.g.
> >>>>>>>>>> you can't have two separate int32 fields).  Java takes this
> >>>> approach.
> >>>>>>>>>> 2.  Generalized unions (unions can have any number of fields
> >>>> with the
> >>>>>>>> same
> >>>>>>>>>> type).  C++ takes this approach.
> >>>>>>>>>>
> >>>>>>>>>> There was a prior PR [2] that stalled in trying to take this
> >>>> approach
> >>>>>>>> with
> >>>>>>>>>> Java.  For writing vectors it seemed to be slower on a
> >>>> benchmark.
> >>>>>>>>>>
> >>>>>>>>>> My proposal:  We should pursue option 2 (the general
> >>>> approach).  There
> >>>>>>>> are
> >>>>>>>>>> already data interchange formats that support it and it would
> >>>> be nice
> >>>>>>>> to a
> >>>>>>>>>> data-model that lets us make the translation between Arrow
> >>>> schemas
> >>>>>>> easy:
> >>>>>>>>>> 1.  Avro Seems to support it [3] (with the exception of
> >>> complex
> >>>> types)
> >>>>>>>>>> 2.  Protobufs loosely support it [4] via one-of.
> >>>>>>>>>>
> >>>>>>>>>> In order to address issues in [2], I propose the following
> >>>> making the
> >>>>>>>>>> changes/additions to the Java implementation:
> >>>>>>>>>> 1.  Keep the default write-path untouched with the existing
> >>>> class.
> >>>>>>>>>> 2.  Add in a new sparse union class that implements the same
> >>>> interface
> >>>>>>>>>> that can be used on the read path, and if a client opts in
> >>> (via
> >>>> direct
> >>>>>>>>>> construction).
> >>>>>>>>>> 3.  Add in a dense union class (I don't believe Java has one).
> >>>>>>>>>>
> >>>>>>>>>> I'm still ramping up the Java code base, so I'd like other
> >>> Java
> >>>>>>>>>> contributors to chime in to see if this plan sounds feasible
> >>> and
> >>>>>>>> acceptable.
> >>>>>>>>>>
> >>>>>>>>>> Any other thoughts on Unions?
> >>>>>>>>>>
> >>>>>>>>>> Thanks,
> >>>>>>>>>> Micah
> >>>>>>>>>>
> >>>>>>>>>> [1]
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>> https://lists.apache.org/thread.html/82ec2049fc3c29de232c9c6962aaee
> >>>> 9ec022d581cecb6cf0eb6a8f36@%3Cdev.arrow.apache.org%3E
> >>>>>>>>>> [2] https://github.com/apache/arrow/pull/987#issuecomment-
> >>>> 493231493
> >>>>>>>>>> [3] https://github.com/apache/arrow/pull/987#issuecomment-
> >>>> 493231493
> >>>>>>>>>> [4]
> >>> https://developers.google.com/protocol-buffers/docs/proto#
> >>>> oneof
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> Thanks and regards,
> >>>>>> Ravindra.
> >>>>
> >>>

Reply via email to