Re: Format: storing null count + required/non-nullable types

Jacques Nadeau Sat, 20 Feb 2016 12:04:04 -0800

Makes sense

On Sat, Feb 20, 2016 at 11:56 AM, Wes McKinney <[email protected]> wrote:


> My expectation would be that data without nulls (as with required
> types) would typically not have the null bitmap allocated at, but this
> would be implementation dependent. For example, in builder classes,
> the first time a null is appended, the null bitmap could be allocated.
>
> In an IPC / wire protocol context, there would be no reason to send
> extra bits when the null count is 0 -- the data receiver, based on
> their implementation, could decide whether or not to allocate a bitmap
> based on that information. Since the data structures are intended as
> immutable, there is no specific need (to create an all-0 bitmap).
>
> On Sat, Feb 20, 2016 at 11:52 AM, Jacques Nadeau <[email protected]>
> wrote:
> > We actually started there (and in fact Drill existed there for the last
> > three years). However, more and more, me and other members of that team
> > have come to the conclusion that the additional complexity isn't worth
> the
> > extra level of code complication. By providing the null count we can
> > achieve the same level of efficiency (+/- carrying around an extra bitmap
> > which is pretty nominal in the grand scheme of things).
> >
> > Another thought could be exposing nullability as a physical property and
> > not have be part of the logical model. That being said, I don't think it
> is
> > worth the headache.
> >
> > On Sat, Feb 20, 2016 at 11:43 AM, Daniel Robinson <
> [email protected]>
> > wrote:
> >
> >> Hi all,
> >>
> >> I like this proposal (as well as the rest of the spec so far!).  But why
> >> not go further and just store arrays that are nullable according to the
> >> schema but have no nulls in them as "non-nullable" data structures—i.e.
> >> structures that have no null bitmask? (After all, it would obviously be
> a
> >> waste to allocate a null bitmask for arrays with null_count = 0.) So
> there
> >> will be two types on the data structure level, and two implementations
> of
> >> every algorithm, one for each of those types.
> >>
> >> If you do that, I'm not sure I see a reason for keeping track of
> >> null_count. Is there ever an efficiency gain from having that stored
> with
> >> an array? Algorithms that might introduce or remove nulls could just
> keep
> >> track of their own "null_count" that increments up from 0, and create a
> >> no-nulls data structure if they never find one.
> >>
> >> I think this might also simplify the system interchange validation
> problem,
> >> since a system could just check the data-structure-level type of the
> input.
> >> (Although I'm not sure I understand why that would be necessary at
> >> "runtime.")
> >>
> >> Perhaps you should have different names for the data-structure-level
> types
> >> to distinguish them from the "nullable" and "non-nullable" types at the
> >> schema level. (And also for philosophical reasons—since the arrays are
> >> immutable, "nullable" doesn't really have meaning on that level, does
> it?).
> >> "some_null" and "no_null"?  Maybe "sparse" and "dense," although that
> too
> >> has a different meaning elsewhere in the spec...
> >>
> >>
> >>
> >> On Sat, Feb 20, 2016 at 12:39 PM, Wes McKinney <[email protected]>
> wrote:
> >>
> >> > hi folks,
> >> >
> >> > welcome to all! It's great to see so many people excited about our
> >> > plans to make data systems faster and more interoperable.
> >> >
> >> > In thinking about building some initial Arrow integrations, I've run
> >> > into a couple of inter-related format questions.
> >> >
> >> > The first is a proposal to add a null count to Arrow arrays. With
> >> > optional/nullable data, null_count == 0 will allow algorithms to skip
> >> > the null-handling code paths and treat the data as
> >> > required/non-nullable, yielding performance benefits. For example:
> >> >
> >> > if (arr->null_count() == 0) {
> >> >   ...
> >> > } else {
> >> >   ...
> >> > }
> >> >
> >> > Relatedly, at the data structure level, there is little semantic
> >> > distinction between these two cases
> >> >
> >> > - Required / Non-nullable arrays
> >> > - Optional arrays with null count 0
> >> >
> >> > My thoughts are that "required-ness" would best be minded at the
> >> > metadata / schema level, rather than tasking the lowest tier of data
> >> > structures and algorithms with handling the two semantically distinct,
> >> > but functionally equivalent forms of data without nulls. When
> >> > performing analytics, it adds complexity as some operations may
> >> > introduce or remove nulls, which would require type metadata to be
> >> > massaged i.e.:
> >> >
> >> > function(required input) -> optional output, versus
> >> >
> >> > function(input [null_count == 0]) -> output [maybe null_count > 0].
> >> >
> >> > In the latter case, algorithms set bits and track the number of nulls
> >> > while constructing the output Arrow array; the former adds some extra
> >> > complexity.
> >> >
> >> > The question of course, is where to enforce "required" in data
> >> > interchange. If two systems have agreed (through exchange of
> >> > schemas/metadata) that a particular batch of Arrow data is
> >> > non-nullable, I would suggest that the null_count == 0 contract be
> >> > validated at that point.
> >> >
> >> > Curious to hear others' thoughts on this, and please let me know if I
> >> > can clarify anything I've said here.
> >> >
> >> > best,
> >> > Wes
> >> >
> >>
>

Re: Format: storing null count + required/non-nullable types

Reply via email to