On Tue, Jun 18, 2019 at 6:38 AM Ben Kietzman <ben.kietz...@rstudio.com> wrote: > > I don't understand the utility of this indirection and to me it seems more > natural to remove Union.typeIds and state that the type buffer will always > contain the index of the corresponding child array. > > One use case for Union.typeIds is simplification of dropping types from the > union. However performing that operation on an array would still require > scanning the types buffer for elements of the dropped types and adding them > to a new null bitmap; it's not zero copy or allocation free and seems like > a poor justification for the added complexity. >
I can see the argument for removing (well, deprecating since removing fields in Flatbuffers is bad) the typeIds metadata field, but it creates a hard requirement that the union types are globally known to all data producers. Otherwise a "unification" step can be required to normalize the type ids. The intent of the type ids was to allow for scenarios where unions can be combined or split without a data-rewrite step; it's possible that these scenarios are too esoteric to burden the average case, but unions are already a fairly specialized tool. In any case, I'm +0 on taking any action, if there are others who have strong opinions about it, we can have a dedicated discussion thread about this issue in particular. > If there is no interest in simplifying by removing this then I'll just > create a JIRA to update the doccomments > > On Mon, Jun 17, 2019 at 5:20 PM Wes McKinney <wesmck...@gmail.com> wrote: > > > example sparse union: > > > > types: (int64, utf8) > > type_ids: [0, 4] > > > > type buffer: [0, 0, 0, 4, 4, 4] > > > > child 0: [1, 2, 3, --, --, --] > > child 1: [--, --, --, 'foo', 'bar', 'baz'] > > > > example dense union: > > > > types: (int64, utf8) > > type_ids: [0, 4] > > > > type buffer: [0, 0, 0, 4, 4, 4] > > offsets buffer: [0, 1, 2, 0, 1, 2] > > > > child 0: [1, 2, 3] > > child 1: ['foo', 'bar', 'baz'] > > > > On Mon, Jun 17, 2019 at 3:50 PM Antoine Pitrou <anto...@python.org> wrote: > > > > > > > > > Le 17/06/2019 à 22:46, Wes McKinney a écrit : > > > > https://github.com/apache/arrow/blob/master/format/Schema.fbs#L88 > > > > > > > > "optionally typeIds provides an indirection between the child offset > > > > and the type id for each child typeIds[offset] is the id used in the > > > > type vector" > > > > > > Does this mean typeIds[logical_id] gives the physical child index? > > > or typeIds[physical_index] gives the logical type id? > > > > > > I don't understand what "offset" means in this context, usually it's a > > > displacement from the beginning of an array. > > > > > > > > > > > > > > On Mon, Jun 17, 2019 at 12:26 PM Ben Kietzman < > > ben.kietz...@rstudio.com> wrote: > > > >> > > > >> Somewhat related: > > > >> > > > >> Could we clarify the expected content of the type_ids buffer of union > > > >> arrays? Layout.rst > > > >> < > > https://github.com/apache/arrow/blob/dede1e6/docs/source/format/Layout.rst#dense-union-type > > > > > > >> seems to indicate these should be indices of the corresponding child > > array, > > > >> but the C++ implementation allows them to be any positive int8 and > > > >> maintains a mapping to child indices. (for example see what is > > generated > > > >> for integration testing > > > >> < > > https://github.com/apache/arrow/blob/dede1e6/cpp/src/arrow/ipc/test-common.cc#L410 > > >: > > > >> has two child arrays and type_ids 5, 10) > > > >> > > > >> On Mon, Jun 17, 2019 at 11:35 AM Micah Kornfield < > > emkornfi...@gmail.com> > > > >> wrote: > > > >> > > > >>> Sounds good. Sorry I got distracted with some other stuff but > > should be > > > >>> getting back to this soonish > > > >>> > > > >>> On Monday, June 17, 2019, Wes McKinney <wesmck...@gmail.com> wrote: > > > >>> > > > >>>> I'd already moved the Union issues to 1.0.0 so we are all good there > > > >>>> > > > >>>> On Mon, Jun 17, 2019 at 10:18 AM Wes McKinney <wesmck...@gmail.com> > > > >>> wrote: > > > >>>>> > > > >>>>> I'm also +1 for generalized unions as we currently have specified. > > The > > > >>>>> objections from the Java users seems to be mostly on the basis of > > > >>>>> performance in the union-of-primitives case -- that's an > > > >>>>> implementation specific issue, so if Java needs to have a > > > >>>>> "GeneralizedDenseUnionVector" or something to handle the > > > >>>>> union-of-anything case, then that seems reasonable. The important > > > >>>>> thing is that the binary protocol itself and serialized metadata is > > > >>>>> something that we are happy with and won't need to change going > > > >>>>> forward. > > > >>>>> > > > >>>>> It seems we're getting a bit long in the tooth to get this into > > 0.14.0 > > > >>>>> so I'm going to move the Union-related issues to the 1.0.0 > > milestone > > > >>>>> so we can get this resolved for the 1.0.0 release > > > >>>>> > > > >>>>> Thanks > > > >>>>> Wes > > > >>>>> > > > >>>>> On Mon, Jun 10, 2019 at 12:33 AM Ravindra Pindikura < > > > >>> ravin...@dremio.com> > > > >>>> wrote: > > > >>>>>> > > > >>>>>> On Sat, May 25, 2019 at 12:29 PM Micah Kornfield < > > > >>>> emkornfi...@gmail.com> > > > >>>>>> wrote: > > > >>>>>> > > > >>>>>>> Thanks for the responses, I've clipped the questions and provided > > > >>>> responses > > > >>>>>>> inline. > > > >>>>>>> > > > >>>>>>> is the proposal that both cpp & java will support only option 2 ? > > > >>>>>>>> I guess 1 is a subset of 2 anyway. > > > >>>>>>> > > > >>>>>>> CPP already supports option 2. I would like to make CPP and java > > > >>>>>>> compatible, in a way that this acceptable for Java maintainers. > > > >>>> Yes, 1 is > > > >>>>>>> a subset of 2. > > > >>>>>>> > > > >>>>>>> The metadata on java side uses the minor type id as the type > > > >>>> identifier in > > > >>>>>>>> the union (and the field name is expected to the the same as the > > > >>>> type > > > >>>>>>> name > > > >>>>>>>> in a union). If you were to support a generalized union, this > > > >>>> wouldn't > > > >>>>>>>> work. How will the type identifiers be generated ? > > > >>>>>>>> I'm trying to see if we can make the change backward compatible, > > > >>>> with > > > >>>>>>>> existing unions in java. > > > >>>>>>> > > > >>>>>>> > > > >>>>>>> Looking at the code, I don't think the existing Union class is > > > >>>>>>> generalizable because of this assumption (it caches a single type > > > >>> of > > > >>>> each > > > >>>>>>> locally) and based on the Javadoc this seems to be for > > performance > > > >>>> reasons, > > > >>>>>>> so I would like to try to avoid touching it if possible. > > > >>>>>>> > > > >>>>>> > > > >>>>>> @Micah Kornfield <emkornfi...@gmail.com> sorry, I wasn't clear. I > > > >>>> meant to > > > >>>>>> ask if the format would be backward compatible, which I think it > > will > > > >>>> be > > > >>>>>> (since 1 is a subset of 2, and your proposal isn't making a > > change in > > > >>>> the > > > >>>>>> wire format). > > > >>>>>> > > > >>>>>> I'm fine if the APIs are not backward compatible. Or, once we > > have 2, > > > >>>> we > > > >>>>>> can add wrappers for 1, if required. > > > >>>>>> > > > >>>>>> > > > >>>>>>> > > > >>>>>>> My original thinking was to try to factor out a common base class > > > >>>> from > > > >>>>>>> UnionVector, then create a new GeneralizedUnionVector class that > > > >>> has > > > >>>>>>> slightly different method signatures (its possible I will need > > > >>>> additional > > > >>>>>>> supporting classes like a new GeneralizedUnionWriter, but i > > haven't > > > >>>> gotten > > > >>>>>>> that far yet). The main challenge I see is a way to let users > > > >>> switch > > > >>>>>>> between the two implementations. Some thoughts off the top of my > > > >>>> head > > > >>>>>>> (none of them seem good): > > > >>>>>>> 1. Create a flag like: > > > >>>>>>> > > > >>>>>>> > > > >>> https://github.com/apache/arrow/blob/ccdaa9f2a4c1af1222df840b608e2e > > > >>>> f465d331fc/java/memory/src/main/java/org/apache/arrow/ > > > >>>> memory/BoundsChecking.java > > > >>>>>>> so > > > >>>>>>> it is statically decided before hand, and have the new class > > > >>>> implement the > > > >>>>>>> same signatures as UnionVector to but throw an exception if a > > > >>> method > > > >>>> that > > > >>>>>>> isn't compatible is called. > > > >>>>>>> 2. Possibly try to augment ArrowType to pass through information > > > >>>> about its > > > >>>>>>> children vectors when reading vectors, but use the flag in > > option 1 > > > >>>> if it > > > >>>>>>> can't be determined. > > > >>>>>>> > > > >>>>>>> I'm open to suggestions. I'm also happy to try to prototype > > > >>>> something and > > > >>>>>>> get feedback once there is concrete code to evaluate. > > > >>>>>>> > > > >>>>>>> I don't understand the limitation to different types, so +1 for > > > >>>>>>>> generalized unions. That said, I don't think it's high-priority > > > >>>> either. > > > >>>>>>> > > > >>>>>>> > > > >>>>>>> Antoine, the fact that it isn't high-priority probably is why it > > > >>>> has taken > > > >>>>>>> so long to resolve. I'm excited to get to more interesting > > higher > > > >>>> priority > > > >>>>>>> work, but I would like to see some of the basics finished off > > > >>>> first. BTW, > > > >>>>>>> if you have suggestions on priorities, I'd be happy to hear them. > > > >>>> After > > > >>>>>>> this, handling empty record batches, and the UBSan work I'm > > doing, > > > >>> I > > > >>>> was > > > >>>>>>> thinking of either trying to get Avro support in, or work on fit > > > >>> and > > > >>>> finish > > > >>>>>>> items for the C++/Python CSV reader. > > > >>>>>>> > > > >>>>>>> Thanks, > > > >>>>>>> Micah > > > >>>>>>> > > > >>>>>>> [1] > > > >>> https://github.com/apache/arrow/pull/987#issuecomment-493231493 > > > >>>>>>> [2] > > > >>>>>>> > > > >>>>>>> > > > >>> https://github.com/apache/arrow/blob/7b2d68570b4336308c52081a034967 > > > >>>> 5e488caf11/java/vector/src/main/java/org/apache/arrow/ > > > >>>> vector/types/pojo/Field.java#L104 > > > >>>>>>> > > > >>>>>>> On Fri, May 24, 2019 at 2:08 AM Antoine Pitrou < > > anto...@python.org > > > >>>> > > > >>>> wrote: > > > >>>>>>> > > > >>>>>>>> > > > >>>>>>>> I don't understand the limitation to different types, so +1 for > > > >>>>>>>> generalized unions. That said, I don't think it's high-priority > > > >>>> either. > > > >>>>>>>> > > > >>>>>>>> Regards > > > >>>>>>>> > > > >>>>>>>> Antoine. > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>>> Le 24/05/2019 à 04:17, Micah Kornfield a écrit : > > > >>>>>>>>> I'd like to bump this thread, to see if anyone has any > > > >>>> comments. If > > > >>>>>>>> nobody > > > >>>>>>>>> objects I will try to start implementing the changes next week. > > > >>>>>>>>> > > > >>>>>>>>> Thanks, > > > >>>>>>>>> Micah > > > >>>>>>>>> > > > >>>>>>>>> On Mon, May 20, 2019 at 9:37 PM Micah Kornfield < > > > >>>> emkornfi...@gmail.com > > > >>>>>>>> > > > >>>>>>>>> wrote: > > > >>>>>>>>> > > > >>>>>>>>>> In the past [1] there hasn't been agreement on the final > > > >>>> requirements > > > >>>>>>>> for > > > >>>>>>>>>> union types. > > > >>>>>>>>>> > > > >>>>>>>>>> Briefly the two approaches that are currently advocated: > > > >>>>>>>>>> 1. Limit unions to only contain one field of each individual > > > >>>> type > > > >>>>>>> (e.g. > > > >>>>>>>>>> you can't have two separate int32 fields). Java takes this > > > >>>> approach. > > > >>>>>>>>>> 2. Generalized unions (unions can have any number of fields > > > >>>> with the > > > >>>>>>>> same > > > >>>>>>>>>> type). C++ takes this approach. > > > >>>>>>>>>> > > > >>>>>>>>>> There was a prior PR [2] that stalled in trying to take this > > > >>>> approach > > > >>>>>>>> with > > > >>>>>>>>>> Java. For writing vectors it seemed to be slower on a > > > >>>> benchmark. > > > >>>>>>>>>> > > > >>>>>>>>>> My proposal: We should pursue option 2 (the general > > > >>>> approach). There > > > >>>>>>>> are > > > >>>>>>>>>> already data interchange formats that support it and it would > > > >>>> be nice > > > >>>>>>>> to a > > > >>>>>>>>>> data-model that lets us make the translation between Arrow > > > >>>> schemas > > > >>>>>>> easy: > > > >>>>>>>>>> 1. Avro Seems to support it [3] (with the exception of > > > >>> complex > > > >>>> types) > > > >>>>>>>>>> 2. Protobufs loosely support it [4] via one-of. > > > >>>>>>>>>> > > > >>>>>>>>>> In order to address issues in [2], I propose the following > > > >>>> making the > > > >>>>>>>>>> changes/additions to the Java implementation: > > > >>>>>>>>>> 1. Keep the default write-path untouched with the existing > > > >>>> class. > > > >>>>>>>>>> 2. Add in a new sparse union class that implements the same > > > >>>> interface > > > >>>>>>>>>> that can be used on the read path, and if a client opts in > > > >>> (via > > > >>>> direct > > > >>>>>>>>>> construction). > > > >>>>>>>>>> 3. Add in a dense union class (I don't believe Java has one). > > > >>>>>>>>>> > > > >>>>>>>>>> I'm still ramping up the Java code base, so I'd like other > > > >>> Java > > > >>>>>>>>>> contributors to chime in to see if this plan sounds feasible > > > >>> and > > > >>>>>>>> acceptable. > > > >>>>>>>>>> > > > >>>>>>>>>> Any other thoughts on Unions? > > > >>>>>>>>>> > > > >>>>>>>>>> Thanks, > > > >>>>>>>>>> Micah > > > >>>>>>>>>> > > > >>>>>>>>>> [1] > > > >>>>>>>>>> > > > >>>>>>>> > > > >>>>>>> > > > >>> https://lists.apache.org/thread.html/82ec2049fc3c29de232c9c6962aaee > > > >>>> 9ec022d581cecb6cf0eb6a8f36@%3Cdev.arrow.apache.org%3E > > > >>>>>>>>>> [2] https://github.com/apache/arrow/pull/987#issuecomment- > > > >>>> 493231493 > > > >>>>>>>>>> [3] https://github.com/apache/arrow/pull/987#issuecomment- > > > >>>> 493231493 > > > >>>>>>>>>> [4] > > > >>> https://developers.google.com/protocol-buffers/docs/proto# > > > >>>> oneof > > > >>>>>>>>>> > > > >>>>>>>>> > > > >>>>>>>> > > > >>>>>>> > > > >>>>>> > > > >>>>>> > > > >>>>>> -- > > > >>>>>> Thanks and regards, > > > >>>>>> Ravindra. > > > >>>> > > > >>> > >