Re: Format: storing null count + required/non-nullable types

Jacques Nadeau Sat, 20 Feb 2016 12:06:27 -0800

Makes sense as in this can be implementation specific.

For a little more background: we've found that while you may do many
algorithms as columnar, you probably won't do everything that way. As such,
It can be easier to avoid the column level property when working on
individual records is easier if you know the bitmap exists which why I'm
inclined to maintain it in the java impl. Especially given that a stream of
record batches can have null_count = 0 for some but not all.


On Sat, Feb 20, 2016 at 11:56 AM, Wes McKinney <[email protected]> wrote:

> My expectation would be that data without nulls (as with required
> types) would typically not have the null bitmap allocated at, but this
> would be implementation dependent. For example, in builder classes,
> the first time a null is appended, the null bitmap could be allocated.
>
> In an IPC / wire protocol context, there would be no reason to send
> extra bits when the null count is 0 -- the data receiver, based on
> their implementation, could decide whether or not to allocate a bitmap
> based on that information. Since the data structures are intended as
> immutable, there is no specific need (to create an all-0 bitmap).
>
> On Sat, Feb 20, 2016 at 11:52 AM, Jacques Nadeau <[email protected]>
> wrote:
> > We actually started there (and in fact Drill existed there for the last
> > three years). However, more and more, me and other members of that team
> > have come to the conclusion that the additional complexity isn't worth
> the
> > extra level of code complication. By providing the null count we can
> > achieve the same level of efficiency (+/- carrying around an extra bitmap
> > which is pretty nominal in the grand scheme of things).
> >
> > Another thought could be exposing nullability as a physical property and
> > not have be part of the logical model. That being said, I don't think it
> is
> > worth the headache.
> >
> > On Sat, Feb 20, 2016 at 11:43 AM, Daniel Robinson <
> [email protected]>
> > wrote:
> >
> >> Hi all,
> >>
> >> I like this proposal (as well as the rest of the spec so far!).  But why
> >> not go further and just store arrays that are nullable according to the
> >> schema but have no nulls in them as "non-nullable" data structures—i.e.
> >> structures that have no null bitmask? (After all, it would obviously be
> a
> >> waste to allocate a null bitmask for arrays with null_count = 0.) So
> there
> >> will be two types on the data structure level, and two implementations
> of
> >> every algorithm, one for each of those types.
> >>
> >> If you do that, I'm not sure I see a reason for keeping track of
> >> null_count. Is there ever an efficiency gain from having that stored
> with
> >> an array? Algorithms that might introduce or remove nulls could just
> keep
> >> track of their own "null_count" that increments up from 0, and create a
> >> no-nulls data structure if they never find one.
> >>
> >> I think this might also simplify the system interchange validation
> problem,
> >> since a system could just check the data-structure-level type of the
> input.
> >> (Although I'm not sure I understand why that would be necessary at
> >> "runtime.")
> >>
> >> Perhaps you should have different names for the data-structure-level
> types
> >> to distinguish them from the "nullable" and "non-nullable" types at the
> >> schema level. (And also for philosophical reasons—since the arrays are
> >> immutable, "nullable" doesn't really have meaning on that level, does
> it?).
> >> "some_null" and "no_null"?  Maybe "sparse" and "dense," although that
> too
> >> has a different meaning elsewhere in the spec...
> >>
> >>
> >>
> >> On Sat, Feb 20, 2016 at 12:39 PM, Wes McKinney <[email protected]>
> wrote:
> >>
> >> > hi folks,
> >> >
> >> > welcome to all! It's great to see so many people excited about our
> >> > plans to make data systems faster and more interoperable.
> >> >
> >> > In thinking about building some initial Arrow integrations, I've run
> >> > into a couple of inter-related format questions.
> >> >
> >> > The first is a proposal to add a null count to Arrow arrays. With
> >> > optional/nullable data, null_count == 0 will allow algorithms to skip
> >> > the null-handling code paths and treat the data as
> >> > required/non-nullable, yielding performance benefits. For example:
> >> >
> >> > if (arr->null_count() == 0) {
> >> >   ...
> >> > } else {
> >> >   ...
> >> > }
> >> >
> >> > Relatedly, at the data structure level, there is little semantic
> >> > distinction between these two cases
> >> >
> >> > - Required / Non-nullable arrays
> >> > - Optional arrays with null count 0
> >> >
> >> > My thoughts are that "required-ness" would best be minded at the
> >> > metadata / schema level, rather than tasking the lowest tier of data
> >> > structures and algorithms with handling the two semantically distinct,
> >> > but functionally equivalent forms of data without nulls. When
> >> > performing analytics, it adds complexity as some operations may
> >> > introduce or remove nulls, which would require type metadata to be
> >> > massaged i.e.:
> >> >
> >> > function(required input) -> optional output, versus
> >> >
> >> > function(input [null_count == 0]) -> output [maybe null_count > 0].
> >> >
> >> > In the latter case, algorithms set bits and track the number of nulls
> >> > while constructing the output Arrow array; the former adds some extra
> >> > complexity.
> >> >
> >> > The question of course, is where to enforce "required" in data
> >> > interchange. If two systems have agreed (through exchange of
> >> > schemas/metadata) that a particular batch of Arrow data is
> >> > non-nullable, I would suggest that the null_count == 0 contract be
> >> > validated at that point.
> >> >
> >> > Curious to hear others' thoughts on this, and please let me know if I
> >> > can clarify anything I've said here.
> >> >
> >> > best,
> >> > Wes
> >> >
> >>
>

Re: Format: storing null count + required/non-nullable types

Reply via email to