I don't understand the utility of this indirection and to me it seems more natural to remove Union.typeIds and state that the type buffer will always contain the index of the corresponding child array.
One use case for Union.typeIds is simplification of dropping types from the union. However performing that operation on an array would still require scanning the types buffer for elements of the dropped types and adding them to a new null bitmap; it's not zero copy or allocation free and seems like a poor justification for the added complexity. If there is no interest in simplifying by removing this then I'll just create a JIRA to update the doccomments On Mon, Jun 17, 2019 at 5:20 PM Wes McKinney <wesmck...@gmail.com> wrote: > example sparse union: > > types: (int64, utf8) > type_ids: [0, 4] > > type buffer: [0, 0, 0, 4, 4, 4] > > child 0: [1, 2, 3, --, --, --] > child 1: [--, --, --, 'foo', 'bar', 'baz'] > > example dense union: > > types: (int64, utf8) > type_ids: [0, 4] > > type buffer: [0, 0, 0, 4, 4, 4] > offsets buffer: [0, 1, 2, 0, 1, 2] > > child 0: [1, 2, 3] > child 1: ['foo', 'bar', 'baz'] > > On Mon, Jun 17, 2019 at 3:50 PM Antoine Pitrou <anto...@python.org> wrote: > > > > > > Le 17/06/2019 à 22:46, Wes McKinney a écrit : > > > https://github.com/apache/arrow/blob/master/format/Schema.fbs#L88 > > > > > > "optionally typeIds provides an indirection between the child offset > > > and the type id for each child typeIds[offset] is the id used in the > > > type vector" > > > > Does this mean typeIds[logical_id] gives the physical child index? > > or typeIds[physical_index] gives the logical type id? > > > > I don't understand what "offset" means in this context, usually it's a > > displacement from the beginning of an array. > > > > > > > > > > On Mon, Jun 17, 2019 at 12:26 PM Ben Kietzman < > ben.kietz...@rstudio.com> wrote: > > >> > > >> Somewhat related: > > >> > > >> Could we clarify the expected content of the type_ids buffer of union > > >> arrays? Layout.rst > > >> < > https://github.com/apache/arrow/blob/dede1e6/docs/source/format/Layout.rst#dense-union-type > > > > >> seems to indicate these should be indices of the corresponding child > array, > > >> but the C++ implementation allows them to be any positive int8 and > > >> maintains a mapping to child indices. (for example see what is > generated > > >> for integration testing > > >> < > https://github.com/apache/arrow/blob/dede1e6/cpp/src/arrow/ipc/test-common.cc#L410 > >: > > >> has two child arrays and type_ids 5, 10) > > >> > > >> On Mon, Jun 17, 2019 at 11:35 AM Micah Kornfield < > emkornfi...@gmail.com> > > >> wrote: > > >> > > >>> Sounds good. Sorry I got distracted with some other stuff but > should be > > >>> getting back to this soonish > > >>> > > >>> On Monday, June 17, 2019, Wes McKinney <wesmck...@gmail.com> wrote: > > >>> > > >>>> I'd already moved the Union issues to 1.0.0 so we are all good there > > >>>> > > >>>> On Mon, Jun 17, 2019 at 10:18 AM Wes McKinney <wesmck...@gmail.com> > > >>> wrote: > > >>>>> > > >>>>> I'm also +1 for generalized unions as we currently have specified. > The > > >>>>> objections from the Java users seems to be mostly on the basis of > > >>>>> performance in the union-of-primitives case -- that's an > > >>>>> implementation specific issue, so if Java needs to have a > > >>>>> "GeneralizedDenseUnionVector" or something to handle the > > >>>>> union-of-anything case, then that seems reasonable. The important > > >>>>> thing is that the binary protocol itself and serialized metadata is > > >>>>> something that we are happy with and won't need to change going > > >>>>> forward. > > >>>>> > > >>>>> It seems we're getting a bit long in the tooth to get this into > 0.14.0 > > >>>>> so I'm going to move the Union-related issues to the 1.0.0 > milestone > > >>>>> so we can get this resolved for the 1.0.0 release > > >>>>> > > >>>>> Thanks > > >>>>> Wes > > >>>>> > > >>>>> On Mon, Jun 10, 2019 at 12:33 AM Ravindra Pindikura < > > >>> ravin...@dremio.com> > > >>>> wrote: > > >>>>>> > > >>>>>> On Sat, May 25, 2019 at 12:29 PM Micah Kornfield < > > >>>> emkornfi...@gmail.com> > > >>>>>> wrote: > > >>>>>> > > >>>>>>> Thanks for the responses, I've clipped the questions and provided > > >>>> responses > > >>>>>>> inline. > > >>>>>>> > > >>>>>>> is the proposal that both cpp & java will support only option 2 ? > > >>>>>>>> I guess 1 is a subset of 2 anyway. > > >>>>>>> > > >>>>>>> CPP already supports option 2. I would like to make CPP and java > > >>>>>>> compatible, in a way that this acceptable for Java maintainers. > > >>>> Yes, 1 is > > >>>>>>> a subset of 2. > > >>>>>>> > > >>>>>>> The metadata on java side uses the minor type id as the type > > >>>> identifier in > > >>>>>>>> the union (and the field name is expected to the the same as the > > >>>> type > > >>>>>>> name > > >>>>>>>> in a union). If you were to support a generalized union, this > > >>>> wouldn't > > >>>>>>>> work. How will the type identifiers be generated ? > > >>>>>>>> I'm trying to see if we can make the change backward compatible, > > >>>> with > > >>>>>>>> existing unions in java. > > >>>>>>> > > >>>>>>> > > >>>>>>> Looking at the code, I don't think the existing Union class is > > >>>>>>> generalizable because of this assumption (it caches a single type > > >>> of > > >>>> each > > >>>>>>> locally) and based on the Javadoc this seems to be for > performance > > >>>> reasons, > > >>>>>>> so I would like to try to avoid touching it if possible. > > >>>>>>> > > >>>>>> > > >>>>>> @Micah Kornfield <emkornfi...@gmail.com> sorry, I wasn't clear. I > > >>>> meant to > > >>>>>> ask if the format would be backward compatible, which I think it > will > > >>>> be > > >>>>>> (since 1 is a subset of 2, and your proposal isn't making a > change in > > >>>> the > > >>>>>> wire format). > > >>>>>> > > >>>>>> I'm fine if the APIs are not backward compatible. Or, once we > have 2, > > >>>> we > > >>>>>> can add wrappers for 1, if required. > > >>>>>> > > >>>>>> > > >>>>>>> > > >>>>>>> My original thinking was to try to factor out a common base class > > >>>> from > > >>>>>>> UnionVector, then create a new GeneralizedUnionVector class that > > >>> has > > >>>>>>> slightly different method signatures (its possible I will need > > >>>> additional > > >>>>>>> supporting classes like a new GeneralizedUnionWriter, but i > haven't > > >>>> gotten > > >>>>>>> that far yet). The main challenge I see is a way to let users > > >>> switch > > >>>>>>> between the two implementations. Some thoughts off the top of my > > >>>> head > > >>>>>>> (none of them seem good): > > >>>>>>> 1. Create a flag like: > > >>>>>>> > > >>>>>>> > > >>> https://github.com/apache/arrow/blob/ccdaa9f2a4c1af1222df840b608e2e > > >>>> f465d331fc/java/memory/src/main/java/org/apache/arrow/ > > >>>> memory/BoundsChecking.java > > >>>>>>> so > > >>>>>>> it is statically decided before hand, and have the new class > > >>>> implement the > > >>>>>>> same signatures as UnionVector to but throw an exception if a > > >>> method > > >>>> that > > >>>>>>> isn't compatible is called. > > >>>>>>> 2. Possibly try to augment ArrowType to pass through information > > >>>> about its > > >>>>>>> children vectors when reading vectors, but use the flag in > option 1 > > >>>> if it > > >>>>>>> can't be determined. > > >>>>>>> > > >>>>>>> I'm open to suggestions. I'm also happy to try to prototype > > >>>> something and > > >>>>>>> get feedback once there is concrete code to evaluate. > > >>>>>>> > > >>>>>>> I don't understand the limitation to different types, so +1 for > > >>>>>>>> generalized unions. That said, I don't think it's high-priority > > >>>> either. > > >>>>>>> > > >>>>>>> > > >>>>>>> Antoine, the fact that it isn't high-priority probably is why it > > >>>> has taken > > >>>>>>> so long to resolve. I'm excited to get to more interesting > higher > > >>>> priority > > >>>>>>> work, but I would like to see some of the basics finished off > > >>>> first. BTW, > > >>>>>>> if you have suggestions on priorities, I'd be happy to hear them. > > >>>> After > > >>>>>>> this, handling empty record batches, and the UBSan work I'm > doing, > > >>> I > > >>>> was > > >>>>>>> thinking of either trying to get Avro support in, or work on fit > > >>> and > > >>>> finish > > >>>>>>> items for the C++/Python CSV reader. > > >>>>>>> > > >>>>>>> Thanks, > > >>>>>>> Micah > > >>>>>>> > > >>>>>>> [1] > > >>> https://github.com/apache/arrow/pull/987#issuecomment-493231493 > > >>>>>>> [2] > > >>>>>>> > > >>>>>>> > > >>> https://github.com/apache/arrow/blob/7b2d68570b4336308c52081a034967 > > >>>> 5e488caf11/java/vector/src/main/java/org/apache/arrow/ > > >>>> vector/types/pojo/Field.java#L104 > > >>>>>>> > > >>>>>>> On Fri, May 24, 2019 at 2:08 AM Antoine Pitrou < > anto...@python.org > > >>>> > > >>>> wrote: > > >>>>>>> > > >>>>>>>> > > >>>>>>>> I don't understand the limitation to different types, so +1 for > > >>>>>>>> generalized unions. That said, I don't think it's high-priority > > >>>> either. > > >>>>>>>> > > >>>>>>>> Regards > > >>>>>>>> > > >>>>>>>> Antoine. > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> Le 24/05/2019 à 04:17, Micah Kornfield a écrit : > > >>>>>>>>> I'd like to bump this thread, to see if anyone has any > > >>>> comments. If > > >>>>>>>> nobody > > >>>>>>>>> objects I will try to start implementing the changes next week. > > >>>>>>>>> > > >>>>>>>>> Thanks, > > >>>>>>>>> Micah > > >>>>>>>>> > > >>>>>>>>> On Mon, May 20, 2019 at 9:37 PM Micah Kornfield < > > >>>> emkornfi...@gmail.com > > >>>>>>>> > > >>>>>>>>> wrote: > > >>>>>>>>> > > >>>>>>>>>> In the past [1] there hasn't been agreement on the final > > >>>> requirements > > >>>>>>>> for > > >>>>>>>>>> union types. > > >>>>>>>>>> > > >>>>>>>>>> Briefly the two approaches that are currently advocated: > > >>>>>>>>>> 1. Limit unions to only contain one field of each individual > > >>>> type > > >>>>>>> (e.g. > > >>>>>>>>>> you can't have two separate int32 fields). Java takes this > > >>>> approach. > > >>>>>>>>>> 2. Generalized unions (unions can have any number of fields > > >>>> with the > > >>>>>>>> same > > >>>>>>>>>> type). C++ takes this approach. > > >>>>>>>>>> > > >>>>>>>>>> There was a prior PR [2] that stalled in trying to take this > > >>>> approach > > >>>>>>>> with > > >>>>>>>>>> Java. For writing vectors it seemed to be slower on a > > >>>> benchmark. > > >>>>>>>>>> > > >>>>>>>>>> My proposal: We should pursue option 2 (the general > > >>>> approach). There > > >>>>>>>> are > > >>>>>>>>>> already data interchange formats that support it and it would > > >>>> be nice > > >>>>>>>> to a > > >>>>>>>>>> data-model that lets us make the translation between Arrow > > >>>> schemas > > >>>>>>> easy: > > >>>>>>>>>> 1. Avro Seems to support it [3] (with the exception of > > >>> complex > > >>>> types) > > >>>>>>>>>> 2. Protobufs loosely support it [4] via one-of. > > >>>>>>>>>> > > >>>>>>>>>> In order to address issues in [2], I propose the following > > >>>> making the > > >>>>>>>>>> changes/additions to the Java implementation: > > >>>>>>>>>> 1. Keep the default write-path untouched with the existing > > >>>> class. > > >>>>>>>>>> 2. Add in a new sparse union class that implements the same > > >>>> interface > > >>>>>>>>>> that can be used on the read path, and if a client opts in > > >>> (via > > >>>> direct > > >>>>>>>>>> construction). > > >>>>>>>>>> 3. Add in a dense union class (I don't believe Java has one). > > >>>>>>>>>> > > >>>>>>>>>> I'm still ramping up the Java code base, so I'd like other > > >>> Java > > >>>>>>>>>> contributors to chime in to see if this plan sounds feasible > > >>> and > > >>>>>>>> acceptable. > > >>>>>>>>>> > > >>>>>>>>>> Any other thoughts on Unions? > > >>>>>>>>>> > > >>>>>>>>>> Thanks, > > >>>>>>>>>> Micah > > >>>>>>>>>> > > >>>>>>>>>> [1] > > >>>>>>>>>> > > >>>>>>>> > > >>>>>>> > > >>> https://lists.apache.org/thread.html/82ec2049fc3c29de232c9c6962aaee > > >>>> 9ec022d581cecb6cf0eb6a8f36@%3Cdev.arrow.apache.org%3E > > >>>>>>>>>> [2] https://github.com/apache/arrow/pull/987#issuecomment- > > >>>> 493231493 > > >>>>>>>>>> [3] https://github.com/apache/arrow/pull/987#issuecomment- > > >>>> 493231493 > > >>>>>>>>>> [4] > > >>> https://developers.google.com/protocol-buffers/docs/proto# > > >>>> oneof > > >>>>>>>>>> > > >>>>>>>>> > > >>>>>>>> > > >>>>>>> > > >>>>>> > > >>>>>> > > >>>>>> -- > > >>>>>> Thanks and regards, > > >>>>>> Ravindra. > > >>>> > > >>> >