example sparse union: types: (int64, utf8) type_ids: [0, 4]
type buffer: [0, 0, 0, 4, 4, 4] child 0: [1, 2, 3, --, --, --] child 1: [--, --, --, 'foo', 'bar', 'baz'] example dense union: types: (int64, utf8) type_ids: [0, 4] type buffer: [0, 0, 0, 4, 4, 4] offsets buffer: [0, 1, 2, 0, 1, 2] child 0: [1, 2, 3] child 1: ['foo', 'bar', 'baz'] On Mon, Jun 17, 2019 at 3:50 PM Antoine Pitrou <anto...@python.org> wrote: > > > Le 17/06/2019 à 22:46, Wes McKinney a écrit : > > https://github.com/apache/arrow/blob/master/format/Schema.fbs#L88 > > > > "optionally typeIds provides an indirection between the child offset > > and the type id for each child typeIds[offset] is the id used in the > > type vector" > > Does this mean typeIds[logical_id] gives the physical child index? > or typeIds[physical_index] gives the logical type id? > > I don't understand what "offset" means in this context, usually it's a > displacement from the beginning of an array. > > > > > > On Mon, Jun 17, 2019 at 12:26 PM Ben Kietzman <ben.kietz...@rstudio.com> > > wrote: > >> > >> Somewhat related: > >> > >> Could we clarify the expected content of the type_ids buffer of union > >> arrays? Layout.rst > >> <https://github.com/apache/arrow/blob/dede1e6/docs/source/format/Layout.rst#dense-union-type> > >> seems to indicate these should be indices of the corresponding child array, > >> but the C++ implementation allows them to be any positive int8 and > >> maintains a mapping to child indices. (for example see what is generated > >> for integration testing > >> <https://github.com/apache/arrow/blob/dede1e6/cpp/src/arrow/ipc/test-common.cc#L410>: > >> has two child arrays and type_ids 5, 10) > >> > >> On Mon, Jun 17, 2019 at 11:35 AM Micah Kornfield <emkornfi...@gmail.com> > >> wrote: > >> > >>> Sounds good. Sorry I got distracted with some other stuff but should be > >>> getting back to this soonish > >>> > >>> On Monday, June 17, 2019, Wes McKinney <wesmck...@gmail.com> wrote: > >>> > >>>> I'd already moved the Union issues to 1.0.0 so we are all good there > >>>> > >>>> On Mon, Jun 17, 2019 at 10:18 AM Wes McKinney <wesmck...@gmail.com> > >>> wrote: > >>>>> > >>>>> I'm also +1 for generalized unions as we currently have specified. The > >>>>> objections from the Java users seems to be mostly on the basis of > >>>>> performance in the union-of-primitives case -- that's an > >>>>> implementation specific issue, so if Java needs to have a > >>>>> "GeneralizedDenseUnionVector" or something to handle the > >>>>> union-of-anything case, then that seems reasonable. The important > >>>>> thing is that the binary protocol itself and serialized metadata is > >>>>> something that we are happy with and won't need to change going > >>>>> forward. > >>>>> > >>>>> It seems we're getting a bit long in the tooth to get this into 0.14.0 > >>>>> so I'm going to move the Union-related issues to the 1.0.0 milestone > >>>>> so we can get this resolved for the 1.0.0 release > >>>>> > >>>>> Thanks > >>>>> Wes > >>>>> > >>>>> On Mon, Jun 10, 2019 at 12:33 AM Ravindra Pindikura < > >>> ravin...@dremio.com> > >>>> wrote: > >>>>>> > >>>>>> On Sat, May 25, 2019 at 12:29 PM Micah Kornfield < > >>>> emkornfi...@gmail.com> > >>>>>> wrote: > >>>>>> > >>>>>>> Thanks for the responses, I've clipped the questions and provided > >>>> responses > >>>>>>> inline. > >>>>>>> > >>>>>>> is the proposal that both cpp & java will support only option 2 ? > >>>>>>>> I guess 1 is a subset of 2 anyway. > >>>>>>> > >>>>>>> CPP already supports option 2. I would like to make CPP and java > >>>>>>> compatible, in a way that this acceptable for Java maintainers. > >>>> Yes, 1 is > >>>>>>> a subset of 2. > >>>>>>> > >>>>>>> The metadata on java side uses the minor type id as the type > >>>> identifier in > >>>>>>>> the union (and the field name is expected to the the same as the > >>>> type > >>>>>>> name > >>>>>>>> in a union). If you were to support a generalized union, this > >>>> wouldn't > >>>>>>>> work. How will the type identifiers be generated ? > >>>>>>>> I'm trying to see if we can make the change backward compatible, > >>>> with > >>>>>>>> existing unions in java. > >>>>>>> > >>>>>>> > >>>>>>> Looking at the code, I don't think the existing Union class is > >>>>>>> generalizable because of this assumption (it caches a single type > >>> of > >>>> each > >>>>>>> locally) and based on the Javadoc this seems to be for performance > >>>> reasons, > >>>>>>> so I would like to try to avoid touching it if possible. > >>>>>>> > >>>>>> > >>>>>> @Micah Kornfield <emkornfi...@gmail.com> sorry, I wasn't clear. I > >>>> meant to > >>>>>> ask if the format would be backward compatible, which I think it will > >>>> be > >>>>>> (since 1 is a subset of 2, and your proposal isn't making a change in > >>>> the > >>>>>> wire format). > >>>>>> > >>>>>> I'm fine if the APIs are not backward compatible. Or, once we have 2, > >>>> we > >>>>>> can add wrappers for 1, if required. > >>>>>> > >>>>>> > >>>>>>> > >>>>>>> My original thinking was to try to factor out a common base class > >>>> from > >>>>>>> UnionVector, then create a new GeneralizedUnionVector class that > >>> has > >>>>>>> slightly different method signatures (its possible I will need > >>>> additional > >>>>>>> supporting classes like a new GeneralizedUnionWriter, but i haven't > >>>> gotten > >>>>>>> that far yet). The main challenge I see is a way to let users > >>> switch > >>>>>>> between the two implementations. Some thoughts off the top of my > >>>> head > >>>>>>> (none of them seem good): > >>>>>>> 1. Create a flag like: > >>>>>>> > >>>>>>> > >>> https://github.com/apache/arrow/blob/ccdaa9f2a4c1af1222df840b608e2e > >>>> f465d331fc/java/memory/src/main/java/org/apache/arrow/ > >>>> memory/BoundsChecking.java > >>>>>>> so > >>>>>>> it is statically decided before hand, and have the new class > >>>> implement the > >>>>>>> same signatures as UnionVector to but throw an exception if a > >>> method > >>>> that > >>>>>>> isn't compatible is called. > >>>>>>> 2. Possibly try to augment ArrowType to pass through information > >>>> about its > >>>>>>> children vectors when reading vectors, but use the flag in option 1 > >>>> if it > >>>>>>> can't be determined. > >>>>>>> > >>>>>>> I'm open to suggestions. I'm also happy to try to prototype > >>>> something and > >>>>>>> get feedback once there is concrete code to evaluate. > >>>>>>> > >>>>>>> I don't understand the limitation to different types, so +1 for > >>>>>>>> generalized unions. That said, I don't think it's high-priority > >>>> either. > >>>>>>> > >>>>>>> > >>>>>>> Antoine, the fact that it isn't high-priority probably is why it > >>>> has taken > >>>>>>> so long to resolve. I'm excited to get to more interesting higher > >>>> priority > >>>>>>> work, but I would like to see some of the basics finished off > >>>> first. BTW, > >>>>>>> if you have suggestions on priorities, I'd be happy to hear them. > >>>> After > >>>>>>> this, handling empty record batches, and the UBSan work I'm doing, > >>> I > >>>> was > >>>>>>> thinking of either trying to get Avro support in, or work on fit > >>> and > >>>> finish > >>>>>>> items for the C++/Python CSV reader. > >>>>>>> > >>>>>>> Thanks, > >>>>>>> Micah > >>>>>>> > >>>>>>> [1] > >>> https://github.com/apache/arrow/pull/987#issuecomment-493231493 > >>>>>>> [2] > >>>>>>> > >>>>>>> > >>> https://github.com/apache/arrow/blob/7b2d68570b4336308c52081a034967 > >>>> 5e488caf11/java/vector/src/main/java/org/apache/arrow/ > >>>> vector/types/pojo/Field.java#L104 > >>>>>>> > >>>>>>> On Fri, May 24, 2019 at 2:08 AM Antoine Pitrou <anto...@python.org > >>>> > >>>> wrote: > >>>>>>> > >>>>>>>> > >>>>>>>> I don't understand the limitation to different types, so +1 for > >>>>>>>> generalized unions. That said, I don't think it's high-priority > >>>> either. > >>>>>>>> > >>>>>>>> Regards > >>>>>>>> > >>>>>>>> Antoine. > >>>>>>>> > >>>>>>>> > >>>>>>>> Le 24/05/2019 à 04:17, Micah Kornfield a écrit : > >>>>>>>>> I'd like to bump this thread, to see if anyone has any > >>>> comments. If > >>>>>>>> nobody > >>>>>>>>> objects I will try to start implementing the changes next week. > >>>>>>>>> > >>>>>>>>> Thanks, > >>>>>>>>> Micah > >>>>>>>>> > >>>>>>>>> On Mon, May 20, 2019 at 9:37 PM Micah Kornfield < > >>>> emkornfi...@gmail.com > >>>>>>>> > >>>>>>>>> wrote: > >>>>>>>>> > >>>>>>>>>> In the past [1] there hasn't been agreement on the final > >>>> requirements > >>>>>>>> for > >>>>>>>>>> union types. > >>>>>>>>>> > >>>>>>>>>> Briefly the two approaches that are currently advocated: > >>>>>>>>>> 1. Limit unions to only contain one field of each individual > >>>> type > >>>>>>> (e.g. > >>>>>>>>>> you can't have two separate int32 fields). Java takes this > >>>> approach. > >>>>>>>>>> 2. Generalized unions (unions can have any number of fields > >>>> with the > >>>>>>>> same > >>>>>>>>>> type). C++ takes this approach. > >>>>>>>>>> > >>>>>>>>>> There was a prior PR [2] that stalled in trying to take this > >>>> approach > >>>>>>>> with > >>>>>>>>>> Java. For writing vectors it seemed to be slower on a > >>>> benchmark. > >>>>>>>>>> > >>>>>>>>>> My proposal: We should pursue option 2 (the general > >>>> approach). There > >>>>>>>> are > >>>>>>>>>> already data interchange formats that support it and it would > >>>> be nice > >>>>>>>> to a > >>>>>>>>>> data-model that lets us make the translation between Arrow > >>>> schemas > >>>>>>> easy: > >>>>>>>>>> 1. Avro Seems to support it [3] (with the exception of > >>> complex > >>>> types) > >>>>>>>>>> 2. Protobufs loosely support it [4] via one-of. > >>>>>>>>>> > >>>>>>>>>> In order to address issues in [2], I propose the following > >>>> making the > >>>>>>>>>> changes/additions to the Java implementation: > >>>>>>>>>> 1. Keep the default write-path untouched with the existing > >>>> class. > >>>>>>>>>> 2. Add in a new sparse union class that implements the same > >>>> interface > >>>>>>>>>> that can be used on the read path, and if a client opts in > >>> (via > >>>> direct > >>>>>>>>>> construction). > >>>>>>>>>> 3. Add in a dense union class (I don't believe Java has one). > >>>>>>>>>> > >>>>>>>>>> I'm still ramping up the Java code base, so I'd like other > >>> Java > >>>>>>>>>> contributors to chime in to see if this plan sounds feasible > >>> and > >>>>>>>> acceptable. > >>>>>>>>>> > >>>>>>>>>> Any other thoughts on Unions? > >>>>>>>>>> > >>>>>>>>>> Thanks, > >>>>>>>>>> Micah > >>>>>>>>>> > >>>>>>>>>> [1] > >>>>>>>>>> > >>>>>>>> > >>>>>>> > >>> https://lists.apache.org/thread.html/82ec2049fc3c29de232c9c6962aaee > >>>> 9ec022d581cecb6cf0eb6a8f36@%3Cdev.arrow.apache.org%3E > >>>>>>>>>> [2] https://github.com/apache/arrow/pull/987#issuecomment- > >>>> 493231493 > >>>>>>>>>> [3] https://github.com/apache/arrow/pull/987#issuecomment- > >>>> 493231493 > >>>>>>>>>> [4] > >>> https://developers.google.com/protocol-buffers/docs/proto# > >>>> oneof > >>>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>> > >>>>>> > >>>>>> -- > >>>>>> Thanks and regards, > >>>>>> Ravindra. > >>>> > >>>