Le 17/06/2019 à 22:46, Wes McKinney a écrit : > https://github.com/apache/arrow/blob/master/format/Schema.fbs#L88 > > "optionally typeIds provides an indirection between the child offset > and the type id for each child typeIds[offset] is the id used in the > type vector"
Does this mean typeIds[logical_id] gives the physical child index? or typeIds[physical_index] gives the logical type id? I don't understand what "offset" means in this context, usually it's a displacement from the beginning of an array. > > On Mon, Jun 17, 2019 at 12:26 PM Ben Kietzman <ben.kietz...@rstudio.com> > wrote: >> >> Somewhat related: >> >> Could we clarify the expected content of the type_ids buffer of union >> arrays? Layout.rst >> <https://github.com/apache/arrow/blob/dede1e6/docs/source/format/Layout.rst#dense-union-type> >> seems to indicate these should be indices of the corresponding child array, >> but the C++ implementation allows them to be any positive int8 and >> maintains a mapping to child indices. (for example see what is generated >> for integration testing >> <https://github.com/apache/arrow/blob/dede1e6/cpp/src/arrow/ipc/test-common.cc#L410>: >> has two child arrays and type_ids 5, 10) >> >> On Mon, Jun 17, 2019 at 11:35 AM Micah Kornfield <emkornfi...@gmail.com> >> wrote: >> >>> Sounds good. Sorry I got distracted with some other stuff but should be >>> getting back to this soonish >>> >>> On Monday, June 17, 2019, Wes McKinney <wesmck...@gmail.com> wrote: >>> >>>> I'd already moved the Union issues to 1.0.0 so we are all good there >>>> >>>> On Mon, Jun 17, 2019 at 10:18 AM Wes McKinney <wesmck...@gmail.com> >>> wrote: >>>>> >>>>> I'm also +1 for generalized unions as we currently have specified. The >>>>> objections from the Java users seems to be mostly on the basis of >>>>> performance in the union-of-primitives case -- that's an >>>>> implementation specific issue, so if Java needs to have a >>>>> "GeneralizedDenseUnionVector" or something to handle the >>>>> union-of-anything case, then that seems reasonable. The important >>>>> thing is that the binary protocol itself and serialized metadata is >>>>> something that we are happy with and won't need to change going >>>>> forward. >>>>> >>>>> It seems we're getting a bit long in the tooth to get this into 0.14.0 >>>>> so I'm going to move the Union-related issues to the 1.0.0 milestone >>>>> so we can get this resolved for the 1.0.0 release >>>>> >>>>> Thanks >>>>> Wes >>>>> >>>>> On Mon, Jun 10, 2019 at 12:33 AM Ravindra Pindikura < >>> ravin...@dremio.com> >>>> wrote: >>>>>> >>>>>> On Sat, May 25, 2019 at 12:29 PM Micah Kornfield < >>>> emkornfi...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Thanks for the responses, I've clipped the questions and provided >>>> responses >>>>>>> inline. >>>>>>> >>>>>>> is the proposal that both cpp & java will support only option 2 ? >>>>>>>> I guess 1 is a subset of 2 anyway. >>>>>>> >>>>>>> CPP already supports option 2. I would like to make CPP and java >>>>>>> compatible, in a way that this acceptable for Java maintainers. >>>> Yes, 1 is >>>>>>> a subset of 2. >>>>>>> >>>>>>> The metadata on java side uses the minor type id as the type >>>> identifier in >>>>>>>> the union (and the field name is expected to the the same as the >>>> type >>>>>>> name >>>>>>>> in a union). If you were to support a generalized union, this >>>> wouldn't >>>>>>>> work. How will the type identifiers be generated ? >>>>>>>> I'm trying to see if we can make the change backward compatible, >>>> with >>>>>>>> existing unions in java. >>>>>>> >>>>>>> >>>>>>> Looking at the code, I don't think the existing Union class is >>>>>>> generalizable because of this assumption (it caches a single type >>> of >>>> each >>>>>>> locally) and based on the Javadoc this seems to be for performance >>>> reasons, >>>>>>> so I would like to try to avoid touching it if possible. >>>>>>> >>>>>> >>>>>> @Micah Kornfield <emkornfi...@gmail.com> sorry, I wasn't clear. I >>>> meant to >>>>>> ask if the format would be backward compatible, which I think it will >>>> be >>>>>> (since 1 is a subset of 2, and your proposal isn't making a change in >>>> the >>>>>> wire format). >>>>>> >>>>>> I'm fine if the APIs are not backward compatible. Or, once we have 2, >>>> we >>>>>> can add wrappers for 1, if required. >>>>>> >>>>>> >>>>>>> >>>>>>> My original thinking was to try to factor out a common base class >>>> from >>>>>>> UnionVector, then create a new GeneralizedUnionVector class that >>> has >>>>>>> slightly different method signatures (its possible I will need >>>> additional >>>>>>> supporting classes like a new GeneralizedUnionWriter, but i haven't >>>> gotten >>>>>>> that far yet). The main challenge I see is a way to let users >>> switch >>>>>>> between the two implementations. Some thoughts off the top of my >>>> head >>>>>>> (none of them seem good): >>>>>>> 1. Create a flag like: >>>>>>> >>>>>>> >>> https://github.com/apache/arrow/blob/ccdaa9f2a4c1af1222df840b608e2e >>>> f465d331fc/java/memory/src/main/java/org/apache/arrow/ >>>> memory/BoundsChecking.java >>>>>>> so >>>>>>> it is statically decided before hand, and have the new class >>>> implement the >>>>>>> same signatures as UnionVector to but throw an exception if a >>> method >>>> that >>>>>>> isn't compatible is called. >>>>>>> 2. Possibly try to augment ArrowType to pass through information >>>> about its >>>>>>> children vectors when reading vectors, but use the flag in option 1 >>>> if it >>>>>>> can't be determined. >>>>>>> >>>>>>> I'm open to suggestions. I'm also happy to try to prototype >>>> something and >>>>>>> get feedback once there is concrete code to evaluate. >>>>>>> >>>>>>> I don't understand the limitation to different types, so +1 for >>>>>>>> generalized unions. That said, I don't think it's high-priority >>>> either. >>>>>>> >>>>>>> >>>>>>> Antoine, the fact that it isn't high-priority probably is why it >>>> has taken >>>>>>> so long to resolve. I'm excited to get to more interesting higher >>>> priority >>>>>>> work, but I would like to see some of the basics finished off >>>> first. BTW, >>>>>>> if you have suggestions on priorities, I'd be happy to hear them. >>>> After >>>>>>> this, handling empty record batches, and the UBSan work I'm doing, >>> I >>>> was >>>>>>> thinking of either trying to get Avro support in, or work on fit >>> and >>>> finish >>>>>>> items for the C++/Python CSV reader. >>>>>>> >>>>>>> Thanks, >>>>>>> Micah >>>>>>> >>>>>>> [1] >>> https://github.com/apache/arrow/pull/987#issuecomment-493231493 >>>>>>> [2] >>>>>>> >>>>>>> >>> https://github.com/apache/arrow/blob/7b2d68570b4336308c52081a034967 >>>> 5e488caf11/java/vector/src/main/java/org/apache/arrow/ >>>> vector/types/pojo/Field.java#L104 >>>>>>> >>>>>>> On Fri, May 24, 2019 at 2:08 AM Antoine Pitrou <anto...@python.org >>>> >>>> wrote: >>>>>>> >>>>>>>> >>>>>>>> I don't understand the limitation to different types, so +1 for >>>>>>>> generalized unions. That said, I don't think it's high-priority >>>> either. >>>>>>>> >>>>>>>> Regards >>>>>>>> >>>>>>>> Antoine. >>>>>>>> >>>>>>>> >>>>>>>> Le 24/05/2019 à 04:17, Micah Kornfield a écrit : >>>>>>>>> I'd like to bump this thread, to see if anyone has any >>>> comments. If >>>>>>>> nobody >>>>>>>>> objects I will try to start implementing the changes next week. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Micah >>>>>>>>> >>>>>>>>> On Mon, May 20, 2019 at 9:37 PM Micah Kornfield < >>>> emkornfi...@gmail.com >>>>>>>> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> In the past [1] there hasn't been agreement on the final >>>> requirements >>>>>>>> for >>>>>>>>>> union types. >>>>>>>>>> >>>>>>>>>> Briefly the two approaches that are currently advocated: >>>>>>>>>> 1. Limit unions to only contain one field of each individual >>>> type >>>>>>> (e.g. >>>>>>>>>> you can't have two separate int32 fields). Java takes this >>>> approach. >>>>>>>>>> 2. Generalized unions (unions can have any number of fields >>>> with the >>>>>>>> same >>>>>>>>>> type). C++ takes this approach. >>>>>>>>>> >>>>>>>>>> There was a prior PR [2] that stalled in trying to take this >>>> approach >>>>>>>> with >>>>>>>>>> Java. For writing vectors it seemed to be slower on a >>>> benchmark. >>>>>>>>>> >>>>>>>>>> My proposal: We should pursue option 2 (the general >>>> approach). There >>>>>>>> are >>>>>>>>>> already data interchange formats that support it and it would >>>> be nice >>>>>>>> to a >>>>>>>>>> data-model that lets us make the translation between Arrow >>>> schemas >>>>>>> easy: >>>>>>>>>> 1. Avro Seems to support it [3] (with the exception of >>> complex >>>> types) >>>>>>>>>> 2. Protobufs loosely support it [4] via one-of. >>>>>>>>>> >>>>>>>>>> In order to address issues in [2], I propose the following >>>> making the >>>>>>>>>> changes/additions to the Java implementation: >>>>>>>>>> 1. Keep the default write-path untouched with the existing >>>> class. >>>>>>>>>> 2. Add in a new sparse union class that implements the same >>>> interface >>>>>>>>>> that can be used on the read path, and if a client opts in >>> (via >>>> direct >>>>>>>>>> construction). >>>>>>>>>> 3. Add in a dense union class (I don't believe Java has one). >>>>>>>>>> >>>>>>>>>> I'm still ramping up the Java code base, so I'd like other >>> Java >>>>>>>>>> contributors to chime in to see if this plan sounds feasible >>> and >>>>>>>> acceptable. >>>>>>>>>> >>>>>>>>>> Any other thoughts on Unions? >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Micah >>>>>>>>>> >>>>>>>>>> [1] >>>>>>>>>> >>>>>>>> >>>>>>> >>> https://lists.apache.org/thread.html/82ec2049fc3c29de232c9c6962aaee >>>> 9ec022d581cecb6cf0eb6a8f36@%3Cdev.arrow.apache.org%3E >>>>>>>>>> [2] https://github.com/apache/arrow/pull/987#issuecomment- >>>> 493231493 >>>>>>>>>> [3] https://github.com/apache/arrow/pull/987#issuecomment- >>>> 493231493 >>>>>>>>>> [4] >>> https://developers.google.com/protocol-buffers/docs/proto# >>>> oneof >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Thanks and regards, >>>>>> Ravindra. >>>> >>>