Re: Format: storing null count + required/non-nullable types

Julian Hyde Sat, 20 Feb 2016 17:17:02 -0800

Have you considered representing nulls using a placeholder value? For values 
with 4 or more bits, there is at least one value that doesn’t occur in a given 
batch. That value could be used as the null placeholder. Then you don’t need a 
null vector.


For small values such as byte and short, it’s possible that the value space is 
filled up, in which case you could move to the next size up.

Representing nulls this way would save allocating an extra piece of memory, 
using up another cache line, and, I imagine could quite often reduce the number 
of branch instructions used: you could check ‘v[i] < 10’ rather than ‘v[i] < 10 
and !isnull[i]’.

Or maybe this kind of null compression (and other kinds of compression like 
dictionary encoding and using non-multiples of 8 bits) belong at the 
application layer?

Julian


> On Feb 20, 2016, at 12:41 PM, Wes McKinney <w...@cloudera.com> wrote:
> 
> Since the data is immutable, not having to ever recompute the null
> count has a lot of benefits -- I find that sum(isnull(x)) is a
> commonly examined statistic. I'm expecting we'll soon have
> hardware-accelerated popcount available in the implementations, so
> that even the null-aware code paths can cheaply skip swaths of all
> null data.
> 
> On Sat, Feb 20, 2016 at 12:31 PM, Jacques Nadeau <jacq...@apache.org> wrote:
>> I think part of it comes from the headache of all the types, etc. You get
>> into some sort of permutation fatigue :)
>> 
>> bool v. integer:
>> 
>> Interesting question. Haven't thought a lot about it but it seems like
>> cardinality of nulls could be a useful metric to decide algorithms. Not
>> sure that it is more expensive to maintain.
>> 
>> On Sat, Feb 20, 2016 at 12:12 PM, Daniel Robinson <danrobinson...@gmail.com>
>> wrote:
>> 
>>> I yield to your judgment from experience. Although I am surprised it
>>> wouldn't simplify the code in this case, since you will use the same
>>> algorithms on (and, if you never allocate a bitmask as Wes suggested, use
>>> the same data structures for) "nullable arrays" with null_count 0 as for
>>> non-nullable arrays. And formally I think it makes sense to have a nullable
>>> Array<T> have one of two types: some_nulls Array<T> or no_nulls Array<T>,
>>> much like its values have either type T or null.
>>> 
>>> Is there a reason to make null_count an integer? Or could it just be a bool
>>> that keeps track of whether there are nulls or not?
>>> 
>>> 
>>> On Sat, Feb 20, 2016 at 3:03 PM, Jacques Nadeau <jacq...@apache.org>
>>> wrote:
>>> 
>>>> Makes sense
>>>> 
>>>> On Sat, Feb 20, 2016 at 11:56 AM, Wes McKinney <w...@cloudera.com> wrote:
>>>> 
>>>>> My expectation would be that data without nulls (as with required
>>>>> types) would typically not have the null bitmap allocated at, but this
>>>>> would be implementation dependent. For example, in builder classes,
>>>>> the first time a null is appended, the null bitmap could be allocated.
>>>>> 
>>>>> In an IPC / wire protocol context, there would be no reason to send
>>>>> extra bits when the null count is 0 -- the data receiver, based on
>>>>> their implementation, could decide whether or not to allocate a bitmap
>>>>> based on that information. Since the data structures are intended as
>>>>> immutable, there is no specific need (to create an all-0 bitmap).
>>>>> 
>>>>> On Sat, Feb 20, 2016 at 11:52 AM, Jacques Nadeau <jacq...@apache.org>
>>>>> wrote:
>>>>>> We actually started there (and in fact Drill existed there for the
>>> last
>>>>>> three years). However, more and more, me and other members of that
>>> team
>>>>>> have come to the conclusion that the additional complexity isn't
>>> worth
>>>>> the
>>>>>> extra level of code complication. By providing the null count we can
>>>>>> achieve the same level of efficiency (+/- carrying around an extra
>>>> bitmap
>>>>>> which is pretty nominal in the grand scheme of things).
>>>>>> 
>>>>>> Another thought could be exposing nullability as a physical property
>>>> and
>>>>>> not have be part of the logical model. That being said, I don't think
>>>> it
>>>>> is
>>>>>> worth the headache.
>>>>>> 
>>>>>> On Sat, Feb 20, 2016 at 11:43 AM, Daniel Robinson <
>>>>> danrobinson...@gmail.com>
>>>>>> wrote:
>>>>>> 
>>>>>>> Hi all,
>>>>>>> 
>>>>>>> I like this proposal (as well as the rest of the spec so far!).  But
>>>> why
>>>>>>> not go further and just store arrays that are nullable according to
>>>> the
>>>>>>> schema but have no nulls in them as "non-nullable" data
>>>> structures—i.e.
>>>>>>> structures that have no null bitmask? (After all, it would obviously
>>>> be
>>>>> a
>>>>>>> waste to allocate a null bitmask for arrays with null_count = 0.) So
>>>>> there
>>>>>>> will be two types on the data structure level, and two
>>> implementations
>>>>> of
>>>>>>> every algorithm, one for each of those types.
>>>>>>> 
>>>>>>> If you do that, I'm not sure I see a reason for keeping track of
>>>>>>> null_count. Is there ever an efficiency gain from having that stored
>>>>> with
>>>>>>> an array? Algorithms that might introduce or remove nulls could just
>>>>> keep
>>>>>>> track of their own "null_count" that increments up from 0, and
>>> create
>>>> a
>>>>>>> no-nulls data structure if they never find one.
>>>>>>> 
>>>>>>> I think this might also simplify the system interchange validation
>>>>> problem,
>>>>>>> since a system could just check the data-structure-level type of the
>>>>> input.
>>>>>>> (Although I'm not sure I understand why that would be necessary at
>>>>>>> "runtime.")
>>>>>>> 
>>>>>>> Perhaps you should have different names for the data-structure-level
>>>>> types
>>>>>>> to distinguish them from the "nullable" and "non-nullable" types at
>>>> the
>>>>>>> schema level. (And also for philosophical reasons—since the arrays
>>> are
>>>>>>> immutable, "nullable" doesn't really have meaning on that level,
>>> does
>>>>> it?).
>>>>>>> "some_null" and "no_null"?  Maybe "sparse" and "dense," although
>>> that
>>>>> too
>>>>>>> has a different meaning elsewhere in the spec...
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Sat, Feb 20, 2016 at 12:39 PM, Wes McKinney <w...@cloudera.com>
>>>>> wrote:
>>>>>>> 
>>>>>>>> hi folks,
>>>>>>>> 
>>>>>>>> welcome to all! It's great to see so many people excited about our
>>>>>>>> plans to make data systems faster and more interoperable.
>>>>>>>> 
>>>>>>>> In thinking about building some initial Arrow integrations, I've
>>> run
>>>>>>>> into a couple of inter-related format questions.
>>>>>>>> 
>>>>>>>> The first is a proposal to add a null count to Arrow arrays. With
>>>>>>>> optional/nullable data, null_count == 0 will allow algorithms to
>>>> skip
>>>>>>>> the null-handling code paths and treat the data as
>>>>>>>> required/non-nullable, yielding performance benefits. For example:
>>>>>>>> 
>>>>>>>> if (arr->null_count() == 0) {
>>>>>>>>  ...
>>>>>>>> } else {
>>>>>>>>  ...
>>>>>>>> }
>>>>>>>> 
>>>>>>>> Relatedly, at the data structure level, there is little semantic
>>>>>>>> distinction between these two cases
>>>>>>>> 
>>>>>>>> - Required / Non-nullable arrays
>>>>>>>> - Optional arrays with null count 0
>>>>>>>> 
>>>>>>>> My thoughts are that "required-ness" would best be minded at the
>>>>>>>> metadata / schema level, rather than tasking the lowest tier of
>>> data
>>>>>>>> structures and algorithms with handling the two semantically
>>>> distinct,
>>>>>>>> but functionally equivalent forms of data without nulls. When
>>>>>>>> performing analytics, it adds complexity as some operations may
>>>>>>>> introduce or remove nulls, which would require type metadata to be
>>>>>>>> massaged i.e.:
>>>>>>>> 
>>>>>>>> function(required input) -> optional output, versus
>>>>>>>> 
>>>>>>>> function(input [null_count == 0]) -> output [maybe null_count >
>>> 0].
>>>>>>>> 
>>>>>>>> In the latter case, algorithms set bits and track the number of
>>>> nulls
>>>>>>>> while constructing the output Arrow array; the former adds some
>>>> extra
>>>>>>>> complexity.
>>>>>>>> 
>>>>>>>> The question of course, is where to enforce "required" in data
>>>>>>>> interchange. If two systems have agreed (through exchange of
>>>>>>>> schemas/metadata) that a particular batch of Arrow data is
>>>>>>>> non-nullable, I would suggest that the null_count == 0 contract be
>>>>>>>> validated at that point.
>>>>>>>> 
>>>>>>>> Curious to hear others' thoughts on this, and please let me know
>>> if
>>>> I
>>>>>>>> can clarify anything I've said here.
>>>>>>>> 
>>>>>>>> best,
>>>>>>>> Wes
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>>>

Re: Format: storing null count + required/non-nullable types

Reply via email to