Thanks Antoine, Can you just clarify what you mean by 'type ids are logical'? In my mind type ids are strongly coupled to the types and their order in Schema.fbs [1]. Do you mean that the order there is only a convention and we can't assume that 0 === Null?
Best, Ryan [1] https://github.com/apache/arrow/blob/master/format/Schema.fbs#L235 On Tue, May 19, 2020 at 2:04 PM Antoine Pitrou <anto...@python.org> wrote: > > Le 19/05/2020 à 13:43, Ryan Murray a écrit : > > Hey All, > > > > While working on https://issues.apache.org/jira/browse/ARROW-1692 I > noticed > > that there is a difference between C++ and Java on the way Sparse Unions > > are handled. I haven't seen in the format spec which the correct is so I > > wanted to check with the wider community. > > > > c++ (and the integration tests) see sparse unions as: > > name > > count > > VALIDITY[] > > TYPE_ID[] > > children[] > > > > and java as: > > name > > count > > TYPE[] > > children[] > > > > The precise names may only be important for json reading/writing in the > > integration tests so I will ignore TYPE/TYPE_ID for now. However, the big > > difference is that Java doesn't have a validity buffer and c++ does. My > > understanding is thta technically the validity buffer is redundant (0 > type > > == NULL) so I can see why Java would omit it. My question is then: which > > language is 'correct'? > > Union type ids are logical, so 0 could very well be a valid type id. > You can't assume that type 0 means a null entry. > > Regards > > Antoine. >