Makes sense On Sat, Feb 20, 2016 at 11:56 AM, Wes McKinney <w...@cloudera.com> wrote:
> My expectation would be that data without nulls (as with required > types) would typically not have the null bitmap allocated at, but this > would be implementation dependent. For example, in builder classes, > the first time a null is appended, the null bitmap could be allocated. > > In an IPC / wire protocol context, there would be no reason to send > extra bits when the null count is 0 -- the data receiver, based on > their implementation, could decide whether or not to allocate a bitmap > based on that information. Since the data structures are intended as > immutable, there is no specific need (to create an all-0 bitmap). > > On Sat, Feb 20, 2016 at 11:52 AM, Jacques Nadeau <jacq...@apache.org> > wrote: > > We actually started there (and in fact Drill existed there for the last > > three years). However, more and more, me and other members of that team > > have come to the conclusion that the additional complexity isn't worth > the > > extra level of code complication. By providing the null count we can > > achieve the same level of efficiency (+/- carrying around an extra bitmap > > which is pretty nominal in the grand scheme of things). > > > > Another thought could be exposing nullability as a physical property and > > not have be part of the logical model. That being said, I don't think it > is > > worth the headache. > > > > On Sat, Feb 20, 2016 at 11:43 AM, Daniel Robinson < > danrobinson...@gmail.com> > > wrote: > > > >> Hi all, > >> > >> I like this proposal (as well as the rest of the spec so far!). But why > >> not go further and just store arrays that are nullable according to the > >> schema but have no nulls in them as "non-nullable" data structures—i.e. > >> structures that have no null bitmask? (After all, it would obviously be > a > >> waste to allocate a null bitmask for arrays with null_count = 0.) So > there > >> will be two types on the data structure level, and two implementations > of > >> every algorithm, one for each of those types. > >> > >> If you do that, I'm not sure I see a reason for keeping track of > >> null_count. Is there ever an efficiency gain from having that stored > with > >> an array? Algorithms that might introduce or remove nulls could just > keep > >> track of their own "null_count" that increments up from 0, and create a > >> no-nulls data structure if they never find one. > >> > >> I think this might also simplify the system interchange validation > problem, > >> since a system could just check the data-structure-level type of the > input. > >> (Although I'm not sure I understand why that would be necessary at > >> "runtime.") > >> > >> Perhaps you should have different names for the data-structure-level > types > >> to distinguish them from the "nullable" and "non-nullable" types at the > >> schema level. (And also for philosophical reasons—since the arrays are > >> immutable, "nullable" doesn't really have meaning on that level, does > it?). > >> "some_null" and "no_null"? Maybe "sparse" and "dense," although that > too > >> has a different meaning elsewhere in the spec... > >> > >> > >> > >> On Sat, Feb 20, 2016 at 12:39 PM, Wes McKinney <w...@cloudera.com> > wrote: > >> > >> > hi folks, > >> > > >> > welcome to all! It's great to see so many people excited about our > >> > plans to make data systems faster and more interoperable. > >> > > >> > In thinking about building some initial Arrow integrations, I've run > >> > into a couple of inter-related format questions. > >> > > >> > The first is a proposal to add a null count to Arrow arrays. With > >> > optional/nullable data, null_count == 0 will allow algorithms to skip > >> > the null-handling code paths and treat the data as > >> > required/non-nullable, yielding performance benefits. For example: > >> > > >> > if (arr->null_count() == 0) { > >> > ... > >> > } else { > >> > ... > >> > } > >> > > >> > Relatedly, at the data structure level, there is little semantic > >> > distinction between these two cases > >> > > >> > - Required / Non-nullable arrays > >> > - Optional arrays with null count 0 > >> > > >> > My thoughts are that "required-ness" would best be minded at the > >> > metadata / schema level, rather than tasking the lowest tier of data > >> > structures and algorithms with handling the two semantically distinct, > >> > but functionally equivalent forms of data without nulls. When > >> > performing analytics, it adds complexity as some operations may > >> > introduce or remove nulls, which would require type metadata to be > >> > massaged i.e.: > >> > > >> > function(required input) -> optional output, versus > >> > > >> > function(input [null_count == 0]) -> output [maybe null_count > 0]. > >> > > >> > In the latter case, algorithms set bits and track the number of nulls > >> > while constructing the output Arrow array; the former adds some extra > >> > complexity. > >> > > >> > The question of course, is where to enforce "required" in data > >> > interchange. If two systems have agreed (through exchange of > >> > schemas/metadata) that a particular batch of Arrow data is > >> > non-nullable, I would suggest that the null_count == 0 contract be > >> > validated at that point. > >> > > >> > Curious to hear others' thoughts on this, and please let me know if I > >> > can clarify anything I've said here. > >> > > >> > best, > >> > Wes > >> > > >> >