Jumping in b/c I did the JS Union implementations. I inferred the behavior from what I understood the C++ and Java to be doing, so I may have misunderstood how they should work.
> To that end, we talked about > introducing a "single-primitive" (a.k.a. "javascript") union behavior that > would operate this way. Just to clarify, Jacques: are you referencing how the ArrowJS Unions work today, or using JavaScript as an adjective to describe the behavior you'd like to see? If the former, I may have misunderstood the distinction between Dense and Sparse Unions (typeIds buffer maps idx -> child_id, with Dense including a valueOffsets buffer to also map idx -> child_idx). I'm happy to review the implementations if this behavior is incorrect. > It would be defined by only allowing one of each > variety of type at any intermediate node of hierarchy. In other words, a > struct could never contain two structs or two lists. (It also couldn't > contain two int64 or int32). This is how the Java library behaves. One way we use the JS Union implementation at Graphistry is representing a heterogenous Struct of IPv4/6 address + port number combinations: > interface IPv4 extends BinaryVector { metadata: { ipVersion: 4 } } > interface IPv6 extends BinaryVector { metadata: { ipVersion: 6 } } > > type IPAddresses = DenseUnion<IPv4 | IPv6> > type IPsAndPorts = Struct<[IPAddress, Int32 /* <- nullable port vector */]> In this case, we benefit from the ability to compact the IP addresses into a dense Binary Vectors, with DenseUnion's valueOffsets buffer acting as an implicit Dictionary encoding -- useful when representing 200k events on an internal network of say, ~200 IPs. Would the "single-primitive" proposal restrict the IPAddresses type from containing two child Binary Vectors? > On Mar 20, 2018, at 10:05 AM, Jacques Nadeau <jacq...@apache.org> wrote: > >> >> I may have missed something, but I'm not remembering either the points >> re: JavaScript or decimals. My understanding is that we have been >> discussing how to handle a union-of-complex-types -- the Union >> implementation in Java does not support this. Could you clarify or >> refer to prior mailing list threads? >> > > Sorry, let me clarify. > > The original thinking was that there is a non-collapsing intermediate node > behavior and an intermediate node collapsing behavior (a.k.a > single-primitive behavior) for unions. For example, if we have the > following records and types (imagine two different sensors generations): > > sensor_gen1: { > ts: <timestamp(nanos)>, > info: { > metric: <utf8>, > value: <double>, > variance: <double> > } > } > > (a.k.a. struct< > ts:timestamp(nanos), > info: struct< > metric: utf8, > value: double, > variance: double >> >> ) > > sensor_gen2: { > ts: <timestamp(nanos)> > info: { > metric: <utf8> > value: <int64> > tolerance: <double> > } > } > > (a.k.a struct< > ts:timestamp(nanos), > info: struct< > metric: utf8, > value: int64, > tolerance: double >> >> ) > > > We have two possible unions that could be created: > > the non-node-collapsing behavior: > struct< > ts:timestamp(nanos), > info: union< > struct< > metric: utf8, > value: double, > variance: double >> , > struct< > metric: utf8, > value: int64, > tolerance: double >> >> >> > > Or the collapsing behavior > > struct< > ts:timestamp(nanos), > info: union< > info: struct< > metric: utf8, > value: union<double, int64>, > tolerance: double > variance: double >> >> > > For generalized data processing (e.g. a sql system), I consider the latter > to be optimal as it allows analysts to deal with sameness without having to > dereference to a particular union branch. To that end, we talked about > introducing a "single-primitive" (a.k.a. "javascript") union behavior that > would operate this way. It would be defined by only allowing one of each > variety of type at any intermediate node of hierarchy. In other words, a > struct could never contain two structs or two lists. (It also couldn't > contain two int64 or int32). This is how the Java library behaves. The > format simplification that is then possible would be that these names would > be directly mapped to known positions (e.g. struct is always in position 1 > and list is always in position 2, etc.). The java library doesn't try to do > the latter at the moment (it used to but the definition wasn't clear). > > The single-primitive behavior in general works very well. It also doesn't > limit a user from having a set of multiple unions that they want to > dereference but does require that each of those branches are named via a > struct rather than using positions in unions. In other words, it doesn't > allow for positional union dereferencing. The one place where it becomes > challenging is when a leaf node is not simple. For example decimal(30,2) > combined with decimal(30,4). In this case, what should the behavior be? > Following a simple-primitive model would suggest that this is only possible > if you named them e.g. struct<dec30_2: decimal(30,2), dec30_4: > decimal(30,4)> but that seems arbitrary since I can also create > union<int32,int64> (which feels very much the same). The problem compounds > as we have added more information at other leaf types (e.g. > timestamp(millis) and timestamp(nanos)). > > So, my suggestion that started the thread was that this single-primitive > behavior not be part of the format but be a choice of the implementation. > In terms of the way to expose the union of structs scenario in Java, I > propose that we implement that as named structs for now and enhance the > behavior if people have use cases that need alternative apis (and are > willing to invest in an arbitrary approach without disrupting the existing > apis). > > >>> - Interval Day to Seconds: 8 bytes representing number of >>> milliseconds. >>> - Interval Year to Months: 4 bytes representing number of months. >> >> Yes, I'm supportive of this. The one addition is that we need to add a >> "unit" field to the metadata to support finer granularity than >> milliseconds -- the idea is that we should support the same units as >> TImestamp so that a difference of timestamps produces an interval (aka >> timedelta). We have this data arising already in Python, for example, >> but we cannot represent it in Arrow at the moment, so this has been a >> rough edge for users. >> >> > Agree on units.