Thanks for the responses, I've clipped the questions and provided responses
inline.

is the proposal that both cpp & java will support only option 2 ?
> I guess 1 is a subset of 2 anyway.

CPP already supports option 2.  I would like to make CPP and java
compatible, in a way that this acceptable for Java maintainers.  Yes, 1 is
a subset of 2.

The metadata on java side uses the minor type id as the type identifier in
> the union (and the field name is expected to the the same as the type name
> in a union). If you were to support a generalized union, this wouldn't
> work. How will the type identifiers be generated ?
> I'm trying to see if we can make the change backward compatible, with
> existing unions in java.


Looking at the code, I don't think the existing Union class is
generalizable because of this assumption (it caches a single type of each
locally) and based on the Javadoc this seems to be for performance reasons,
so I would like to try to avoid touching it if possible.

My original thinking was to try to factor out a common base class from
UnionVector, then create a new GeneralizedUnionVector class that has
slightly different method signatures (its possible I will need additional
supporting classes like a new GeneralizedUnionWriter, but i haven't gotten
that far yet).  The main challenge I see is a way to let users switch
between the two implementations.  Some thoughts off the top of my head
(none of them seem good):
1.  Create a flag like:
https://github.com/apache/arrow/blob/ccdaa9f2a4c1af1222df840b608e2ef465d331fc/java/memory/src/main/java/org/apache/arrow/memory/BoundsChecking.java
so
it is statically decided before hand, and have the new class implement the
same signatures as UnionVector to but throw an exception if a method that
isn't compatible is called.
2. Possibly try to augment ArrowType to pass through information about its
children vectors when reading vectors, but use the flag in option 1 if it
can't be determined.

I'm open to suggestions.  I'm also happy to try to prototype something and
get feedback once there is concrete code to evaluate.

I don't understand the limitation to different types, so +1 for
> generalized unions.  That said, I don't think it's high-priority either.


Antoine, the fact that it isn't high-priority probably  is why it has taken
so long to resolve.  I'm excited to get to more interesting higher priority
work, but I would like to see some of the basics finished off first.  BTW,
if you have suggestions on priorities, I'd be happy to hear them.  After
this, handling empty record batches, and the UBSan work I'm doing, I was
thinking of either trying to get Avro support in, or work on fit and finish
items for the C++/Python CSV reader.

Thanks,
Micah

[1] https://github.com/apache/arrow/pull/987#issuecomment-493231493
[2]
https://github.com/apache/arrow/blob/7b2d68570b4336308c52081a0349675e488caf11/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Field.java#L104

On Fri, May 24, 2019 at 2:08 AM Antoine Pitrou <anto...@python.org> wrote:

>
> I don't understand the limitation to different types, so +1 for
> generalized unions.  That said, I don't think it's high-priority either.
>
> Regards
>
> Antoine.
>
>
> Le 24/05/2019 à 04:17, Micah Kornfield a écrit :
> > I'd like to bump this thread, to see if anyone has any comments.  If
> nobody
> > objects I will try to start implementing the changes next week.
> >
> > Thanks,
> > Micah
> >
> > On Mon, May 20, 2019 at 9:37 PM Micah Kornfield <emkornfi...@gmail.com>
> > wrote:
> >
> >> In the past [1] there hasn't been agreement on the final requirements
> for
> >> union types.
> >>
> >> Briefly the two approaches that are currently advocated:
> >> 1.  Limit unions to only contain one field of each individual type (e.g.
> >> you can't have two separate int32 fields).  Java takes this approach.
> >> 2.  Generalized unions (unions can have any number of fields with the
> same
> >> type).  C++ takes this approach.
> >>
> >> There was a prior PR [2] that stalled in trying to take this approach
> with
> >> Java.  For writing vectors it seemed to be slower on a benchmark.
> >>
> >> My proposal:  We should pursue option 2 (the general approach).  There
> are
> >> already data interchange formats that support it and it would be nice
> to a
> >> data-model that lets us make the translation between Arrow schemas easy:
> >> 1.  Avro Seems to support it [3] (with the exception of complex types)
> >> 2.  Protobufs loosely support it [4] via one-of.
> >>
> >> In order to address issues in [2], I propose the following making the
> >> changes/additions to the Java implementation:
> >> 1.  Keep the default write-path untouched with the existing class.
> >> 2.  Add in a new sparse union class that implements the same interface
> >> that can be used on the read path, and if a client opts in (via direct
> >> construction).
> >> 3.  Add in a dense union class (I don't believe Java has one).
> >>
> >> I'm still ramping up the Java code base, so I'd like other Java
> >> contributors to chime in to see if this plan sounds feasible and
> acceptable.
> >>
> >> Any other thoughts on Unions?
> >>
> >> Thanks,
> >> Micah
> >>
> >> [1]
> >>
> https://lists.apache.org/thread.html/82ec2049fc3c29de232c9c6962aaee9ec022d581cecb6cf0eb6a8f36@%3Cdev.arrow.apache.org%3E
> >> [2] https://github.com/apache/arrow/pull/987#issuecomment-493231493
> >> [3] https://github.com/apache/arrow/pull/987#issuecomment-493231493
> >> [4] https://developers.google.com/protocol-buffers/docs/proto#oneof
> >>
> >
>

Reply via email to