Le 17/06/2019 à 22:46, Wes McKinney a écrit :
> https://github.com/apache/arrow/blob/master/format/Schema.fbs#L88
> 
> "optionally typeIds provides an indirection between the child offset
> and the type id for each child typeIds[offset] is the id used in the
> type vector"

Does this mean typeIds[logical_id] gives the physical child index?
or typeIds[physical_index] gives the logical type id?

I don't understand what "offset" means in this context, usually it's a
displacement from the beginning of an array.


> 
> On Mon, Jun 17, 2019 at 12:26 PM Ben Kietzman <ben.kietz...@rstudio.com> 
> wrote:
>>
>> Somewhat related:
>>
>> Could we clarify the expected content of the type_ids buffer of union
>> arrays? Layout.rst
>> <https://github.com/apache/arrow/blob/dede1e6/docs/source/format/Layout.rst#dense-union-type>
>> seems to indicate these should be indices of the corresponding child array,
>> but the C++ implementation allows them to be any positive int8 and
>> maintains a mapping to child indices. (for example see what is generated
>> for integration testing
>> <https://github.com/apache/arrow/blob/dede1e6/cpp/src/arrow/ipc/test-common.cc#L410>:
>> has two child arrays and type_ids 5, 10)
>>
>> On Mon, Jun 17, 2019 at 11:35 AM Micah Kornfield <emkornfi...@gmail.com>
>> wrote:
>>
>>> Sounds good.  Sorry I got distracted with some other stuff but should be
>>> getting back to this soonish
>>>
>>> On Monday, June 17, 2019, Wes McKinney <wesmck...@gmail.com> wrote:
>>>
>>>> I'd already moved the Union issues to 1.0.0 so we are all good there
>>>>
>>>> On Mon, Jun 17, 2019 at 10:18 AM Wes McKinney <wesmck...@gmail.com>
>>> wrote:
>>>>>
>>>>> I'm also +1 for generalized unions as we currently have specified. The
>>>>> objections from the Java users seems to be mostly on the basis of
>>>>> performance in the union-of-primitives case -- that's an
>>>>> implementation specific issue, so if Java needs to have a
>>>>> "GeneralizedDenseUnionVector" or something to handle the
>>>>> union-of-anything case, then that seems reasonable. The important
>>>>> thing is that the binary protocol itself and serialized metadata is
>>>>> something that we are happy with and won't need to change going
>>>>> forward.
>>>>>
>>>>> It seems we're getting a bit long in the tooth to get this into 0.14.0
>>>>> so I'm going to move the Union-related issues to the 1.0.0 milestone
>>>>> so we can get this resolved for the 1.0.0 release
>>>>>
>>>>> Thanks
>>>>> Wes
>>>>>
>>>>> On Mon, Jun 10, 2019 at 12:33 AM Ravindra Pindikura <
>>> ravin...@dremio.com>
>>>> wrote:
>>>>>>
>>>>>> On Sat, May 25, 2019 at 12:29 PM Micah Kornfield <
>>>> emkornfi...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Thanks for the responses, I've clipped the questions and provided
>>>> responses
>>>>>>> inline.
>>>>>>>
>>>>>>> is the proposal that both cpp & java will support only option 2 ?
>>>>>>>> I guess 1 is a subset of 2 anyway.
>>>>>>>
>>>>>>> CPP already supports option 2.  I would like to make CPP and java
>>>>>>> compatible, in a way that this acceptable for Java maintainers.
>>>> Yes, 1 is
>>>>>>> a subset of 2.
>>>>>>>
>>>>>>> The metadata on java side uses the minor type id as the type
>>>> identifier in
>>>>>>>> the union (and the field name is expected to the the same as the
>>>> type
>>>>>>> name
>>>>>>>> in a union). If you were to support a generalized union, this
>>>> wouldn't
>>>>>>>> work. How will the type identifiers be generated ?
>>>>>>>> I'm trying to see if we can make the change backward compatible,
>>>> with
>>>>>>>> existing unions in java.
>>>>>>>
>>>>>>>
>>>>>>> Looking at the code, I don't think the existing Union class is
>>>>>>> generalizable because of this assumption (it caches a single type
>>> of
>>>> each
>>>>>>> locally) and based on the Javadoc this seems to be for performance
>>>> reasons,
>>>>>>> so I would like to try to avoid touching it if possible.
>>>>>>>
>>>>>>
>>>>>> @Micah Kornfield <emkornfi...@gmail.com> sorry, I wasn't clear. I
>>>> meant to
>>>>>> ask if the format would be backward compatible, which I think it will
>>>> be
>>>>>> (since 1 is a subset of 2, and your proposal isn't making a change in
>>>> the
>>>>>> wire format).
>>>>>>
>>>>>> I'm fine if the APIs are not backward compatible. Or, once we have 2,
>>>> we
>>>>>> can add wrappers for 1, if required.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> My original thinking was to try to factor out a common base class
>>>> from
>>>>>>> UnionVector, then create a new GeneralizedUnionVector class that
>>> has
>>>>>>> slightly different method signatures (its possible I will need
>>>> additional
>>>>>>> supporting classes like a new GeneralizedUnionWriter, but i haven't
>>>> gotten
>>>>>>> that far yet).  The main challenge I see is a way to let users
>>> switch
>>>>>>> between the two implementations.  Some thoughts off the top of my
>>>> head
>>>>>>> (none of them seem good):
>>>>>>> 1.  Create a flag like:
>>>>>>>
>>>>>>>
>>> https://github.com/apache/arrow/blob/ccdaa9f2a4c1af1222df840b608e2e
>>>> f465d331fc/java/memory/src/main/java/org/apache/arrow/
>>>> memory/BoundsChecking.java
>>>>>>> so
>>>>>>> it is statically decided before hand, and have the new class
>>>> implement the
>>>>>>> same signatures as UnionVector to but throw an exception if a
>>> method
>>>> that
>>>>>>> isn't compatible is called.
>>>>>>> 2. Possibly try to augment ArrowType to pass through information
>>>> about its
>>>>>>> children vectors when reading vectors, but use the flag in option 1
>>>> if it
>>>>>>> can't be determined.
>>>>>>>
>>>>>>> I'm open to suggestions.  I'm also happy to try to prototype
>>>> something and
>>>>>>> get feedback once there is concrete code to evaluate.
>>>>>>>
>>>>>>> I don't understand the limitation to different types, so +1 for
>>>>>>>> generalized unions.  That said, I don't think it's high-priority
>>>> either.
>>>>>>>
>>>>>>>
>>>>>>> Antoine, the fact that it isn't high-priority probably  is why it
>>>> has taken
>>>>>>> so long to resolve.  I'm excited to get to more interesting higher
>>>> priority
>>>>>>> work, but I would like to see some of the basics finished off
>>>> first.  BTW,
>>>>>>> if you have suggestions on priorities, I'd be happy to hear them.
>>>> After
>>>>>>> this, handling empty record batches, and the UBSan work I'm doing,
>>> I
>>>> was
>>>>>>> thinking of either trying to get Avro support in, or work on fit
>>> and
>>>> finish
>>>>>>> items for the C++/Python CSV reader.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Micah
>>>>>>>
>>>>>>> [1]
>>> https://github.com/apache/arrow/pull/987#issuecomment-493231493
>>>>>>> [2]
>>>>>>>
>>>>>>>
>>> https://github.com/apache/arrow/blob/7b2d68570b4336308c52081a034967
>>>> 5e488caf11/java/vector/src/main/java/org/apache/arrow/
>>>> vector/types/pojo/Field.java#L104
>>>>>>>
>>>>>>> On Fri, May 24, 2019 at 2:08 AM Antoine Pitrou <anto...@python.org
>>>>
>>>> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> I don't understand the limitation to different types, so +1 for
>>>>>>>> generalized unions.  That said, I don't think it's high-priority
>>>> either.
>>>>>>>>
>>>>>>>> Regards
>>>>>>>>
>>>>>>>> Antoine.
>>>>>>>>
>>>>>>>>
>>>>>>>> Le 24/05/2019 à 04:17, Micah Kornfield a écrit :
>>>>>>>>> I'd like to bump this thread, to see if anyone has any
>>>> comments.  If
>>>>>>>> nobody
>>>>>>>>> objects I will try to start implementing the changes next week.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Micah
>>>>>>>>>
>>>>>>>>> On Mon, May 20, 2019 at 9:37 PM Micah Kornfield <
>>>> emkornfi...@gmail.com
>>>>>>>>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> In the past [1] there hasn't been agreement on the final
>>>> requirements
>>>>>>>> for
>>>>>>>>>> union types.
>>>>>>>>>>
>>>>>>>>>> Briefly the two approaches that are currently advocated:
>>>>>>>>>> 1.  Limit unions to only contain one field of each individual
>>>> type
>>>>>>> (e.g.
>>>>>>>>>> you can't have two separate int32 fields).  Java takes this
>>>> approach.
>>>>>>>>>> 2.  Generalized unions (unions can have any number of fields
>>>> with the
>>>>>>>> same
>>>>>>>>>> type).  C++ takes this approach.
>>>>>>>>>>
>>>>>>>>>> There was a prior PR [2] that stalled in trying to take this
>>>> approach
>>>>>>>> with
>>>>>>>>>> Java.  For writing vectors it seemed to be slower on a
>>>> benchmark.
>>>>>>>>>>
>>>>>>>>>> My proposal:  We should pursue option 2 (the general
>>>> approach).  There
>>>>>>>> are
>>>>>>>>>> already data interchange formats that support it and it would
>>>> be nice
>>>>>>>> to a
>>>>>>>>>> data-model that lets us make the translation between Arrow
>>>> schemas
>>>>>>> easy:
>>>>>>>>>> 1.  Avro Seems to support it [3] (with the exception of
>>> complex
>>>> types)
>>>>>>>>>> 2.  Protobufs loosely support it [4] via one-of.
>>>>>>>>>>
>>>>>>>>>> In order to address issues in [2], I propose the following
>>>> making the
>>>>>>>>>> changes/additions to the Java implementation:
>>>>>>>>>> 1.  Keep the default write-path untouched with the existing
>>>> class.
>>>>>>>>>> 2.  Add in a new sparse union class that implements the same
>>>> interface
>>>>>>>>>> that can be used on the read path, and if a client opts in
>>> (via
>>>> direct
>>>>>>>>>> construction).
>>>>>>>>>> 3.  Add in a dense union class (I don't believe Java has one).
>>>>>>>>>>
>>>>>>>>>> I'm still ramping up the Java code base, so I'd like other
>>> Java
>>>>>>>>>> contributors to chime in to see if this plan sounds feasible
>>> and
>>>>>>>> acceptable.
>>>>>>>>>>
>>>>>>>>>> Any other thoughts on Unions?
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Micah
>>>>>>>>>>
>>>>>>>>>> [1]
>>>>>>>>>>
>>>>>>>>
>>>>>>>
>>> https://lists.apache.org/thread.html/82ec2049fc3c29de232c9c6962aaee
>>>> 9ec022d581cecb6cf0eb6a8f36@%3Cdev.arrow.apache.org%3E
>>>>>>>>>> [2] https://github.com/apache/arrow/pull/987#issuecomment-
>>>> 493231493
>>>>>>>>>> [3] https://github.com/apache/arrow/pull/987#issuecomment-
>>>> 493231493
>>>>>>>>>> [4]
>>> https://developers.google.com/protocol-buffers/docs/proto#
>>>> oneof
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Thanks and regards,
>>>>>> Ravindra.
>>>>
>>>

Reply via email to