Re: Format: storing null count + required/non-nullable types

Wes McKinney Sat, 20 Feb 2016 18:48:07 -0800

Some inline thoughts

On Sat, Feb 20, 2016 at 5:16 PM, Julian Hyde <jh...@apache.org> wrote:
> Have you considered representing nulls using a placeholder value? For values 
> with 4 or more bits, there is at least one value that doesn’t occur in a 
> given batch. That value could be used as the null placeholder. Then you don’t 
> need a null vector.
>


Using any kind of null sentinel / placeholder is tricky. Using a
constant sentinel is not desirable as it limits your ability to fully
represent the full range of values within primitive C types (also,
what sentinel do you use for variable-length / list types?). Having a
dynamic sentinel is interesting, but consider:

- How do you know what value to use in incremental builder classes?
- If two arrays of the same type use different, sentinels,
concatenation may require non-trivial computation beyond a memcpy
- Bitmasks allow you to use hardware popcount to determine if a range
of values contains any nulls at all, which is much faster than
value-by-value comparisons

I haven't done a microbenchmark to understand the performance
implications of primitive value comparison versus bit checking; that
would be useful information to have.

> For small values such as byte and short, it’s possible that the value space 
> is filled up, in which case you could move to the next size up.
>

In general, many applications may not find this palatable, as data
already stored in a particular integer type (e.g. int8) would have to
be upcasted if it is deemed to not "fit" (determine this could itself
be computationally intensive).

> Representing nulls this way would save allocating an extra piece of memory, 
> using up another cache line, and, I imagine could quite often reduce the 
> number of branch instructions used: you could check ‘v[i] < 10’ rather than 
> ‘v[i] < 10 and !isnull[i]’.
>
> Or maybe this kind of null compression (and other kinds of compression like 
> dictionary encoding and using non-multiples of 8 bits) belong at the 
> application layer?
>

Perhaps so -- several people have asked me about dictionary encoding
in the context of Arrow. In that particular case, a dictionary-encoded
array is really two arrays -- an integer index array plus the
dictionary array -- interpreted as an array of the logical type of the
dictionary. That seems to me like a problem of the metadata or
application domain. Of course, we can provide fast algorithms within
Arrow for dictionary encoding and other such standard operations.

> Julian
>
>
>> On Feb 20, 2016, at 12:41 PM, Wes McKinney <w...@cloudera.com> wrote:
>>
>> Since the data is immutable, not having to ever recompute the null
>> count has a lot of benefits -- I find that sum(isnull(x)) is a
>> commonly examined statistic. I'm expecting we'll soon have
>> hardware-accelerated popcount available in the implementations, so
>> that even the null-aware code paths can cheaply skip swaths of all
>> null data.
>>
>> On Sat, Feb 20, 2016 at 12:31 PM, Jacques Nadeau <jacq...@apache.org> wrote:
>>> I think part of it comes from the headache of all the types, etc. You get
>>> into some sort of permutation fatigue :)
>>>
>>> bool v. integer:
>>>
>>> Interesting question. Haven't thought a lot about it but it seems like
>>> cardinality of nulls could be a useful metric to decide algorithms. Not
>>> sure that it is more expensive to maintain.
>>>
>>> On Sat, Feb 20, 2016 at 12:12 PM, Daniel Robinson <danrobinson...@gmail.com>
>>> wrote:
>>>
>>>> I yield to your judgment from experience. Although I am surprised it
>>>> wouldn't simplify the code in this case, since you will use the same
>>>> algorithms on (and, if you never allocate a bitmask as Wes suggested, use
>>>> the same data structures for) "nullable arrays" with null_count 0 as for
>>>> non-nullable arrays. And formally I think it makes sense to have a nullable
>>>> Array<T> have one of two types: some_nulls Array<T> or no_nulls Array<T>,
>>>> much like its values have either type T or null.
>>>>
>>>> Is there a reason to make null_count an integer? Or could it just be a bool
>>>> that keeps track of whether there are nulls or not?
>>>>
>>>>
>>>> On Sat, Feb 20, 2016 at 3:03 PM, Jacques Nadeau <jacq...@apache.org>
>>>> wrote:
>>>>
>>>>> Makes sense
>>>>>
>>>>> On Sat, Feb 20, 2016 at 11:56 AM, Wes McKinney <w...@cloudera.com> wrote:
>>>>>
>>>>>> My expectation would be that data without nulls (as with required
>>>>>> types) would typically not have the null bitmap allocated at, but this
>>>>>> would be implementation dependent. For example, in builder classes,
>>>>>> the first time a null is appended, the null bitmap could be allocated.
>>>>>>
>>>>>> In an IPC / wire protocol context, there would be no reason to send
>>>>>> extra bits when the null count is 0 -- the data receiver, based on
>>>>>> their implementation, could decide whether or not to allocate a bitmap
>>>>>> based on that information. Since the data structures are intended as
>>>>>> immutable, there is no specific need (to create an all-0 bitmap).
>>>>>>
>>>>>> On Sat, Feb 20, 2016 at 11:52 AM, Jacques Nadeau <jacq...@apache.org>
>>>>>> wrote:
>>>>>>> We actually started there (and in fact Drill existed there for the
>>>> last
>>>>>>> three years). However, more and more, me and other members of that
>>>> team
>>>>>>> have come to the conclusion that the additional complexity isn't
>>>> worth
>>>>>> the
>>>>>>> extra level of code complication. By providing the null count we can
>>>>>>> achieve the same level of efficiency (+/- carrying around an extra
>>>>> bitmap
>>>>>>> which is pretty nominal in the grand scheme of things).
>>>>>>>
>>>>>>> Another thought could be exposing nullability as a physical property
>>>>> and
>>>>>>> not have be part of the logical model. That being said, I don't think
>>>>> it
>>>>>> is
>>>>>>> worth the headache.
>>>>>>>
>>>>>>> On Sat, Feb 20, 2016 at 11:43 AM, Daniel Robinson <
>>>>>> danrobinson...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> I like this proposal (as well as the rest of the spec so far!).  But
>>>>> why
>>>>>>>> not go further and just store arrays that are nullable according to
>>>>> the
>>>>>>>> schema but have no nulls in them as "non-nullable" data
>>>>> structures—i.e.
>>>>>>>> structures that have no null bitmask? (After all, it would obviously
>>>>> be
>>>>>> a
>>>>>>>> waste to allocate a null bitmask for arrays with null_count = 0.) So
>>>>>> there
>>>>>>>> will be two types on the data structure level, and two
>>>> implementations
>>>>>> of
>>>>>>>> every algorithm, one for each of those types.
>>>>>>>>
>>>>>>>> If you do that, I'm not sure I see a reason for keeping track of
>>>>>>>> null_count. Is there ever an efficiency gain from having that stored
>>>>>> with
>>>>>>>> an array? Algorithms that might introduce or remove nulls could just
>>>>>> keep
>>>>>>>> track of their own "null_count" that increments up from 0, and
>>>> create
>>>>> a
>>>>>>>> no-nulls data structure if they never find one.
>>>>>>>>
>>>>>>>> I think this might also simplify the system interchange validation
>>>>>> problem,
>>>>>>>> since a system could just check the data-structure-level type of the
>>>>>> input.
>>>>>>>> (Although I'm not sure I understand why that would be necessary at
>>>>>>>> "runtime.")
>>>>>>>>
>>>>>>>> Perhaps you should have different names for the data-structure-level
>>>>>> types
>>>>>>>> to distinguish them from the "nullable" and "non-nullable" types at
>>>>> the
>>>>>>>> schema level. (And also for philosophical reasons—since the arrays
>>>> are
>>>>>>>> immutable, "nullable" doesn't really have meaning on that level,
>>>> does
>>>>>> it?).
>>>>>>>> "some_null" and "no_null"?  Maybe "sparse" and "dense," although
>>>> that
>>>>>> too
>>>>>>>> has a different meaning elsewhere in the spec...
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, Feb 20, 2016 at 12:39 PM, Wes McKinney <w...@cloudera.com>
>>>>>> wrote:
>>>>>>>>
>>>>>>>>> hi folks,
>>>>>>>>>
>>>>>>>>> welcome to all! It's great to see so many people excited about our
>>>>>>>>> plans to make data systems faster and more interoperable.
>>>>>>>>>
>>>>>>>>> In thinking about building some initial Arrow integrations, I've
>>>> run
>>>>>>>>> into a couple of inter-related format questions.
>>>>>>>>>
>>>>>>>>> The first is a proposal to add a null count to Arrow arrays. With
>>>>>>>>> optional/nullable data, null_count == 0 will allow algorithms to
>>>>> skip
>>>>>>>>> the null-handling code paths and treat the data as
>>>>>>>>> required/non-nullable, yielding performance benefits. For example:
>>>>>>>>>
>>>>>>>>> if (arr->null_count() == 0) {
>>>>>>>>>  ...
>>>>>>>>> } else {
>>>>>>>>>  ...
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> Relatedly, at the data structure level, there is little semantic
>>>>>>>>> distinction between these two cases
>>>>>>>>>
>>>>>>>>> - Required / Non-nullable arrays
>>>>>>>>> - Optional arrays with null count 0
>>>>>>>>>
>>>>>>>>> My thoughts are that "required-ness" would best be minded at the
>>>>>>>>> metadata / schema level, rather than tasking the lowest tier of
>>>> data
>>>>>>>>> structures and algorithms with handling the two semantically
>>>>> distinct,
>>>>>>>>> but functionally equivalent forms of data without nulls. When
>>>>>>>>> performing analytics, it adds complexity as some operations may
>>>>>>>>> introduce or remove nulls, which would require type metadata to be
>>>>>>>>> massaged i.e.:
>>>>>>>>>
>>>>>>>>> function(required input) -> optional output, versus
>>>>>>>>>
>>>>>>>>> function(input [null_count == 0]) -> output [maybe null_count >
>>>> 0].
>>>>>>>>>
>>>>>>>>> In the latter case, algorithms set bits and track the number of
>>>>> nulls
>>>>>>>>> while constructing the output Arrow array; the former adds some
>>>>> extra
>>>>>>>>> complexity.
>>>>>>>>>
>>>>>>>>> The question of course, is where to enforce "required" in data
>>>>>>>>> interchange. If two systems have agreed (through exchange of
>>>>>>>>> schemas/metadata) that a particular batch of Arrow data is
>>>>>>>>> non-nullable, I would suggest that the null_count == 0 contract be
>>>>>>>>> validated at that point.
>>>>>>>>>
>>>>>>>>> Curious to hear others' thoughts on this, and please let me know
>>>> if
>>>>> I
>>>>>>>>> can clarify anything I've said here.
>>>>>>>>>
>>>>>>>>> best,
>>>>>>>>> Wes
>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>
>>>>
>

Re: Format: storing null count + required/non-nullable types

Reply via email to