In the past [1] there hasn't been agreement on the final requirements for
union types.

Briefly the two approaches that are currently advocated:
1.  Limit unions to only contain one field of each individual type (e.g.
you can't have two separate int32 fields).  Java takes this approach.
2.  Generalized unions (unions can have any number of fields with the same
type).  C++ takes this approach.

There was a prior PR [2] that stalled in trying to take this approach with
Java.  For writing vectors it seemed to be slower on a benchmark.

My proposal:  We should pursue option 2 (the general approach).  There are
already data interchange formats that support it and it would be nice to a
data-model that lets us make the translation between Arrow schemas easy:
1.  Avro Seems to support it [3] (with the exception of complex types)
2.  Protobufs loosely support it [4] via one-of.

In order to address issues in [2], I propose the following making the
changes/additions to the Java implementation:
1.  Keep the default write-path untouched with the existing class.
2.  Add in a new sparse union class that implements the same interface that
can be used on the read path, and if a client opts in (via direct
construction).
3.  Add in a dense union class (I don't believe Java has one).

I'm still ramping up the Java code base, so I'd like other Java
contributors to chime in to see if this plan sounds feasible and acceptable.

Any other thoughts on Unions?

Thanks,
Micah

[1]
https://lists.apache.org/thread.html/82ec2049fc3c29de232c9c6962aaee9ec022d581cecb6cf0eb6a8f36@%3Cdev.arrow.apache.org%3E
[2] https://github.com/apache/arrow/pull/987#issuecomment-493231493
[3] https://github.com/apache/arrow/pull/987#issuecomment-493231493
[4] https://developers.google.com/protocol-buffers/docs/proto#oneof

Reply via email to