I'd already moved the Union issues to 1.0.0 so we are all good there

On Mon, Jun 17, 2019 at 10:18 AM Wes McKinney <wesmck...@gmail.com> wrote:
>
> I'm also +1 for generalized unions as we currently have specified. The
> objections from the Java users seems to be mostly on the basis of
> performance in the union-of-primitives case -- that's an
> implementation specific issue, so if Java needs to have a
> "GeneralizedDenseUnionVector" or something to handle the
> union-of-anything case, then that seems reasonable. The important
> thing is that the binary protocol itself and serialized metadata is
> something that we are happy with and won't need to change going
> forward.
>
> It seems we're getting a bit long in the tooth to get this into 0.14.0
> so I'm going to move the Union-related issues to the 1.0.0 milestone
> so we can get this resolved for the 1.0.0 release
>
> Thanks
> Wes
>
> On Mon, Jun 10, 2019 at 12:33 AM Ravindra Pindikura <ravin...@dremio.com> 
> wrote:
> >
> > On Sat, May 25, 2019 at 12:29 PM Micah Kornfield <emkornfi...@gmail.com>
> > wrote:
> >
> > > Thanks for the responses, I've clipped the questions and provided 
> > > responses
> > > inline.
> > >
> > > is the proposal that both cpp & java will support only option 2 ?
> > > > I guess 1 is a subset of 2 anyway.
> > >
> > > CPP already supports option 2.  I would like to make CPP and java
> > > compatible, in a way that this acceptable for Java maintainers.  Yes, 1 is
> > > a subset of 2.
> > >
> > > The metadata on java side uses the minor type id as the type identifier in
> > > > the union (and the field name is expected to the the same as the type
> > > name
> > > > in a union). If you were to support a generalized union, this wouldn't
> > > > work. How will the type identifiers be generated ?
> > > > I'm trying to see if we can make the change backward compatible, with
> > > > existing unions in java.
> > >
> > >
> > > Looking at the code, I don't think the existing Union class is
> > > generalizable because of this assumption (it caches a single type of each
> > > locally) and based on the Javadoc this seems to be for performance 
> > > reasons,
> > > so I would like to try to avoid touching it if possible.
> > >
> >
> > @Micah Kornfield <emkornfi...@gmail.com> sorry, I wasn't clear. I meant to
> > ask if the format would be backward compatible, which I think it will be
> > (since 1 is a subset of 2, and your proposal isn't making a change in the
> > wire format).
> >
> > I'm fine if the APIs are not backward compatible. Or, once we have 2, we
> > can add wrappers for 1, if required.
> >
> >
> > >
> > > My original thinking was to try to factor out a common base class from
> > > UnionVector, then create a new GeneralizedUnionVector class that has
> > > slightly different method signatures (its possible I will need additional
> > > supporting classes like a new GeneralizedUnionWriter, but i haven't gotten
> > > that far yet).  The main challenge I see is a way to let users switch
> > > between the two implementations.  Some thoughts off the top of my head
> > > (none of them seem good):
> > > 1.  Create a flag like:
> > >
> > > https://github.com/apache/arrow/blob/ccdaa9f2a4c1af1222df840b608e2ef465d331fc/java/memory/src/main/java/org/apache/arrow/memory/BoundsChecking.java
> > > so
> > > it is statically decided before hand, and have the new class implement the
> > > same signatures as UnionVector to but throw an exception if a method that
> > > isn't compatible is called.
> > > 2. Possibly try to augment ArrowType to pass through information about its
> > > children vectors when reading vectors, but use the flag in option 1 if it
> > > can't be determined.
> > >
> > > I'm open to suggestions.  I'm also happy to try to prototype something and
> > > get feedback once there is concrete code to evaluate.
> > >
> > > I don't understand the limitation to different types, so +1 for
> > > > generalized unions.  That said, I don't think it's high-priority either.
> > >
> > >
> > > Antoine, the fact that it isn't high-priority probably  is why it has 
> > > taken
> > > so long to resolve.  I'm excited to get to more interesting higher 
> > > priority
> > > work, but I would like to see some of the basics finished off first.  BTW,
> > > if you have suggestions on priorities, I'd be happy to hear them.  After
> > > this, handling empty record batches, and the UBSan work I'm doing, I was
> > > thinking of either trying to get Avro support in, or work on fit and 
> > > finish
> > > items for the C++/Python CSV reader.
> > >
> > > Thanks,
> > > Micah
> > >
> > > [1] https://github.com/apache/arrow/pull/987#issuecomment-493231493
> > > [2]
> > >
> > > https://github.com/apache/arrow/blob/7b2d68570b4336308c52081a0349675e488caf11/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Field.java#L104
> > >
> > > On Fri, May 24, 2019 at 2:08 AM Antoine Pitrou <anto...@python.org> wrote:
> > >
> > > >
> > > > I don't understand the limitation to different types, so +1 for
> > > > generalized unions.  That said, I don't think it's high-priority either.
> > > >
> > > > Regards
> > > >
> > > > Antoine.
> > > >
> > > >
> > > > Le 24/05/2019 à 04:17, Micah Kornfield a écrit :
> > > > > I'd like to bump this thread, to see if anyone has any comments.  If
> > > > nobody
> > > > > objects I will try to start implementing the changes next week.
> > > > >
> > > > > Thanks,
> > > > > Micah
> > > > >
> > > > > On Mon, May 20, 2019 at 9:37 PM Micah Kornfield <emkornfi...@gmail.com
> > > >
> > > > > wrote:
> > > > >
> > > > >> In the past [1] there hasn't been agreement on the final requirements
> > > > for
> > > > >> union types.
> > > > >>
> > > > >> Briefly the two approaches that are currently advocated:
> > > > >> 1.  Limit unions to only contain one field of each individual type
> > > (e.g.
> > > > >> you can't have two separate int32 fields).  Java takes this approach.
> > > > >> 2.  Generalized unions (unions can have any number of fields with the
> > > > same
> > > > >> type).  C++ takes this approach.
> > > > >>
> > > > >> There was a prior PR [2] that stalled in trying to take this approach
> > > > with
> > > > >> Java.  For writing vectors it seemed to be slower on a benchmark.
> > > > >>
> > > > >> My proposal:  We should pursue option 2 (the general approach).  
> > > > >> There
> > > > are
> > > > >> already data interchange formats that support it and it would be nice
> > > > to a
> > > > >> data-model that lets us make the translation between Arrow schemas
> > > easy:
> > > > >> 1.  Avro Seems to support it [3] (with the exception of complex 
> > > > >> types)
> > > > >> 2.  Protobufs loosely support it [4] via one-of.
> > > > >>
> > > > >> In order to address issues in [2], I propose the following making the
> > > > >> changes/additions to the Java implementation:
> > > > >> 1.  Keep the default write-path untouched with the existing class.
> > > > >> 2.  Add in a new sparse union class that implements the same 
> > > > >> interface
> > > > >> that can be used on the read path, and if a client opts in (via 
> > > > >> direct
> > > > >> construction).
> > > > >> 3.  Add in a dense union class (I don't believe Java has one).
> > > > >>
> > > > >> I'm still ramping up the Java code base, so I'd like other Java
> > > > >> contributors to chime in to see if this plan sounds feasible and
> > > > acceptable.
> > > > >>
> > > > >> Any other thoughts on Unions?
> > > > >>
> > > > >> Thanks,
> > > > >> Micah
> > > > >>
> > > > >> [1]
> > > > >>
> > > >
> > > https://lists.apache.org/thread.html/82ec2049fc3c29de232c9c6962aaee9ec022d581cecb6cf0eb6a8f36@%3Cdev.arrow.apache.org%3E
> > > > >> [2] https://github.com/apache/arrow/pull/987#issuecomment-493231493
> > > > >> [3] https://github.com/apache/arrow/pull/987#issuecomment-493231493
> > > > >> [4] https://developers.google.com/protocol-buffers/docs/proto#oneof
> > > > >>
> > > > >
> > > >
> > >
> >
> >
> > --
> > Thanks and regards,
> > Ravindra.

Reply via email to