Re: Format: storing null count + required/non-nullable types

Jacques Nadeau Sat, 20 Feb 2016 11:53:29 -0800

We actually started there (and in fact Drill existed there for the last
three years). However, more and more, me and other members of that team
have come to the conclusion that the additional complexity isn't worth the
extra level of code complication. By providing the null count we can
achieve the same level of efficiency (+/- carrying around an extra bitmap
which is pretty nominal in the grand scheme of things).


Another thought could be exposing nullability as a physical property and
not have be part of the logical model. That being said, I don't think it is
worth the headache.

On Sat, Feb 20, 2016 at 11:43 AM, Daniel Robinson <[email protected]>
wrote:

> Hi all,
>
> I like this proposal (as well as the rest of the spec so far!).  But why
> not go further and just store arrays that are nullable according to the
> schema but have no nulls in them as "non-nullable" data structures—i.e.
> structures that have no null bitmask? (After all, it would obviously be a
> waste to allocate a null bitmask for arrays with null_count = 0.) So there
> will be two types on the data structure level, and two implementations of
> every algorithm, one for each of those types.
>
> If you do that, I'm not sure I see a reason for keeping track of
> null_count. Is there ever an efficiency gain from having that stored with
> an array? Algorithms that might introduce or remove nulls could just keep
> track of their own "null_count" that increments up from 0, and create a
> no-nulls data structure if they never find one.
>
> I think this might also simplify the system interchange validation problem,
> since a system could just check the data-structure-level type of the input.
> (Although I'm not sure I understand why that would be necessary at
> "runtime.")
>
> Perhaps you should have different names for the data-structure-level types
> to distinguish them from the "nullable" and "non-nullable" types at the
> schema level. (And also for philosophical reasons—since the arrays are
> immutable, "nullable" doesn't really have meaning on that level, does it?).
> "some_null" and "no_null"?  Maybe "sparse" and "dense," although that too
> has a different meaning elsewhere in the spec...
>
>
>
> On Sat, Feb 20, 2016 at 12:39 PM, Wes McKinney <[email protected]> wrote:
>
> > hi folks,
> >
> > welcome to all! It's great to see so many people excited about our
> > plans to make data systems faster and more interoperable.
> >
> > In thinking about building some initial Arrow integrations, I've run
> > into a couple of inter-related format questions.
> >
> > The first is a proposal to add a null count to Arrow arrays. With
> > optional/nullable data, null_count == 0 will allow algorithms to skip
> > the null-handling code paths and treat the data as
> > required/non-nullable, yielding performance benefits. For example:
> >
> > if (arr->null_count() == 0) {
> >   ...
> > } else {
> >   ...
> > }
> >
> > Relatedly, at the data structure level, there is little semantic
> > distinction between these two cases
> >
> > - Required / Non-nullable arrays
> > - Optional arrays with null count 0
> >
> > My thoughts are that "required-ness" would best be minded at the
> > metadata / schema level, rather than tasking the lowest tier of data
> > structures and algorithms with handling the two semantically distinct,
> > but functionally equivalent forms of data without nulls. When
> > performing analytics, it adds complexity as some operations may
> > introduce or remove nulls, which would require type metadata to be
> > massaged i.e.:
> >
> > function(required input) -> optional output, versus
> >
> > function(input [null_count == 0]) -> output [maybe null_count > 0].
> >
> > In the latter case, algorithms set bits and track the number of nulls
> > while constructing the output Arrow array; the former adds some extra
> > complexity.
> >
> > The question of course, is where to enforce "required" in data
> > interchange. If two systems have agreed (through exchange of
> > schemas/metadata) that a particular batch of Arrow data is
> > non-nullable, I would suggest that the null_count == 0 contract be
> > validated at that point.
> >
> > Curious to hear others' thoughts on this, and please let me know if I
> > can clarify anything I've said here.
> >
> > best,
> > Wes
> >
>

Re: Format: storing null count + required/non-nullable types

Reply via email to