Sounds good. Sorry I got distracted with some other stuff but should be getting back to this soonish
On Monday, June 17, 2019, Wes McKinney <wesmck...@gmail.com> wrote: > I'd already moved the Union issues to 1.0.0 so we are all good there > > On Mon, Jun 17, 2019 at 10:18 AM Wes McKinney <wesmck...@gmail.com> wrote: > > > > I'm also +1 for generalized unions as we currently have specified. The > > objections from the Java users seems to be mostly on the basis of > > performance in the union-of-primitives case -- that's an > > implementation specific issue, so if Java needs to have a > > "GeneralizedDenseUnionVector" or something to handle the > > union-of-anything case, then that seems reasonable. The important > > thing is that the binary protocol itself and serialized metadata is > > something that we are happy with and won't need to change going > > forward. > > > > It seems we're getting a bit long in the tooth to get this into 0.14.0 > > so I'm going to move the Union-related issues to the 1.0.0 milestone > > so we can get this resolved for the 1.0.0 release > > > > Thanks > > Wes > > > > On Mon, Jun 10, 2019 at 12:33 AM Ravindra Pindikura <ravin...@dremio.com> > wrote: > > > > > > On Sat, May 25, 2019 at 12:29 PM Micah Kornfield < > emkornfi...@gmail.com> > > > wrote: > > > > > > > Thanks for the responses, I've clipped the questions and provided > responses > > > > inline. > > > > > > > > is the proposal that both cpp & java will support only option 2 ? > > > > > I guess 1 is a subset of 2 anyway. > > > > > > > > CPP already supports option 2. I would like to make CPP and java > > > > compatible, in a way that this acceptable for Java maintainers. > Yes, 1 is > > > > a subset of 2. > > > > > > > > The metadata on java side uses the minor type id as the type > identifier in > > > > > the union (and the field name is expected to the the same as the > type > > > > name > > > > > in a union). If you were to support a generalized union, this > wouldn't > > > > > work. How will the type identifiers be generated ? > > > > > I'm trying to see if we can make the change backward compatible, > with > > > > > existing unions in java. > > > > > > > > > > > > Looking at the code, I don't think the existing Union class is > > > > generalizable because of this assumption (it caches a single type of > each > > > > locally) and based on the Javadoc this seems to be for performance > reasons, > > > > so I would like to try to avoid touching it if possible. > > > > > > > > > > @Micah Kornfield <emkornfi...@gmail.com> sorry, I wasn't clear. I > meant to > > > ask if the format would be backward compatible, which I think it will > be > > > (since 1 is a subset of 2, and your proposal isn't making a change in > the > > > wire format). > > > > > > I'm fine if the APIs are not backward compatible. Or, once we have 2, > we > > > can add wrappers for 1, if required. > > > > > > > > > > > > > > My original thinking was to try to factor out a common base class > from > > > > UnionVector, then create a new GeneralizedUnionVector class that has > > > > slightly different method signatures (its possible I will need > additional > > > > supporting classes like a new GeneralizedUnionWriter, but i haven't > gotten > > > > that far yet). The main challenge I see is a way to let users switch > > > > between the two implementations. Some thoughts off the top of my > head > > > > (none of them seem good): > > > > 1. Create a flag like: > > > > > > > > https://github.com/apache/arrow/blob/ccdaa9f2a4c1af1222df840b608e2e > f465d331fc/java/memory/src/main/java/org/apache/arrow/ > memory/BoundsChecking.java > > > > so > > > > it is statically decided before hand, and have the new class > implement the > > > > same signatures as UnionVector to but throw an exception if a method > that > > > > isn't compatible is called. > > > > 2. Possibly try to augment ArrowType to pass through information > about its > > > > children vectors when reading vectors, but use the flag in option 1 > if it > > > > can't be determined. > > > > > > > > I'm open to suggestions. I'm also happy to try to prototype > something and > > > > get feedback once there is concrete code to evaluate. > > > > > > > > I don't understand the limitation to different types, so +1 for > > > > > generalized unions. That said, I don't think it's high-priority > either. > > > > > > > > > > > > Antoine, the fact that it isn't high-priority probably is why it > has taken > > > > so long to resolve. I'm excited to get to more interesting higher > priority > > > > work, but I would like to see some of the basics finished off > first. BTW, > > > > if you have suggestions on priorities, I'd be happy to hear them. > After > > > > this, handling empty record batches, and the UBSan work I'm doing, I > was > > > > thinking of either trying to get Avro support in, or work on fit and > finish > > > > items for the C++/Python CSV reader. > > > > > > > > Thanks, > > > > Micah > > > > > > > > [1] https://github.com/apache/arrow/pull/987#issuecomment-493231493 > > > > [2] > > > > > > > > https://github.com/apache/arrow/blob/7b2d68570b4336308c52081a034967 > 5e488caf11/java/vector/src/main/java/org/apache/arrow/ > vector/types/pojo/Field.java#L104 > > > > > > > > On Fri, May 24, 2019 at 2:08 AM Antoine Pitrou <anto...@python.org> > wrote: > > > > > > > > > > > > > > I don't understand the limitation to different types, so +1 for > > > > > generalized unions. That said, I don't think it's high-priority > either. > > > > > > > > > > Regards > > > > > > > > > > Antoine. > > > > > > > > > > > > > > > Le 24/05/2019 à 04:17, Micah Kornfield a écrit : > > > > > > I'd like to bump this thread, to see if anyone has any > comments. If > > > > > nobody > > > > > > objects I will try to start implementing the changes next week. > > > > > > > > > > > > Thanks, > > > > > > Micah > > > > > > > > > > > > On Mon, May 20, 2019 at 9:37 PM Micah Kornfield < > emkornfi...@gmail.com > > > > > > > > > > > wrote: > > > > > > > > > > > >> In the past [1] there hasn't been agreement on the final > requirements > > > > > for > > > > > >> union types. > > > > > >> > > > > > >> Briefly the two approaches that are currently advocated: > > > > > >> 1. Limit unions to only contain one field of each individual > type > > > > (e.g. > > > > > >> you can't have two separate int32 fields). Java takes this > approach. > > > > > >> 2. Generalized unions (unions can have any number of fields > with the > > > > > same > > > > > >> type). C++ takes this approach. > > > > > >> > > > > > >> There was a prior PR [2] that stalled in trying to take this > approach > > > > > with > > > > > >> Java. For writing vectors it seemed to be slower on a > benchmark. > > > > > >> > > > > > >> My proposal: We should pursue option 2 (the general > approach). There > > > > > are > > > > > >> already data interchange formats that support it and it would > be nice > > > > > to a > > > > > >> data-model that lets us make the translation between Arrow > schemas > > > > easy: > > > > > >> 1. Avro Seems to support it [3] (with the exception of complex > types) > > > > > >> 2. Protobufs loosely support it [4] via one-of. > > > > > >> > > > > > >> In order to address issues in [2], I propose the following > making the > > > > > >> changes/additions to the Java implementation: > > > > > >> 1. Keep the default write-path untouched with the existing > class. > > > > > >> 2. Add in a new sparse union class that implements the same > interface > > > > > >> that can be used on the read path, and if a client opts in (via > direct > > > > > >> construction). > > > > > >> 3. Add in a dense union class (I don't believe Java has one). > > > > > >> > > > > > >> I'm still ramping up the Java code base, so I'd like other Java > > > > > >> contributors to chime in to see if this plan sounds feasible and > > > > > acceptable. > > > > > >> > > > > > >> Any other thoughts on Unions? > > > > > >> > > > > > >> Thanks, > > > > > >> Micah > > > > > >> > > > > > >> [1] > > > > > >> > > > > > > > > > https://lists.apache.org/thread.html/82ec2049fc3c29de232c9c6962aaee > 9ec022d581cecb6cf0eb6a8f36@%3Cdev.arrow.apache.org%3E > > > > > >> [2] https://github.com/apache/arrow/pull/987#issuecomment- > 493231493 > > > > > >> [3] https://github.com/apache/arrow/pull/987#issuecomment- > 493231493 > > > > > >> [4] https://developers.google.com/protocol-buffers/docs/proto# > oneof > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > -- > > > Thanks and regards, > > > Ravindra. >