Re: Format: storing null count + required/non-nullable types

Wes McKinney Thu, 03 Mar 2016 15:16:14 -0800

Returning to this discussion. I did some C++ prototyping

https://github.com/apache/arrow/pull/9
https://github.com/apache/arrow/pull/10


A handful of thoughts:

1) It is most useful for compatibility with other systems (e.g. Parquet --
see ARROW-22) to have required/optional in the type metadata, but not in
the array data containers. For example, see some Schema unit tests here:
https://github.com/apache/arrow/pull/10/files#diff-3204cc418d726cfc6b7bc68f6ace30f5R18

2) The number of types and code branches to consider is definitely simpler
if you store only the null count in the data structures (in other words
"let the data be data").

3) The requirement that required types can only associate with data having
null_count == 0 is a contract that may be best enforced at IO / IPC
boundaries (but this may vary from implementation to implementation). In
the absence of restrictions, optional/nullable is probably a sensible
default.

Scrutiny welcome! This is important in the context of the metadata
definition process.

best,
Wes

On Sat, Feb 20, 2016 at 6:47 PM, Wes McKinney <w...@cloudera.com> wrote:

> Some inline thoughts
>
> On Sat, Feb 20, 2016 at 5:16 PM, Julian Hyde <jh...@apache.org> wrote:
> > Have you considered representing nulls using a placeholder value? For
> values with 4 or more bits, there is at least one value that doesn’t occur
> in a given batch. That value could be used as the null placeholder. Then
> you don’t need a null vector.
> >
>
> Using any kind of null sentinel / placeholder is tricky. Using a
> constant sentinel is not desirable as it limits your ability to fully
> represent the full range of values within primitive C types (also,
> what sentinel do you use for variable-length / list types?). Having a
> dynamic sentinel is interesting, but consider:
>
> - How do you know what value to use in incremental builder classes?
> - If two arrays of the same type use different, sentinels,
> concatenation may require non-trivial computation beyond a memcpy
> - Bitmasks allow you to use hardware popcount to determine if a range
> of values contains any nulls at all, which is much faster than
> value-by-value comparisons
>
> I haven't done a microbenchmark to understand the performance
> implications of primitive value comparison versus bit checking; that
> would be useful information to have.
>
> > For small values such as byte and short, it’s possible that the value
> space is filled up, in which case you could move to the next size up.
> >
>
> In general, many applications may not find this palatable, as data
> already stored in a particular integer type (e.g. int8) would have to
> be upcasted if it is deemed to not "fit" (determine this could itself
> be computationally intensive).
>
> > Representing nulls this way would save allocating an extra piece of
> memory, using up another cache line, and, I imagine could quite often
> reduce the number of branch instructions used: you could check ‘v[i] < 10’
> rather than ‘v[i] < 10 and !isnull[i]’.
> >
> > Or maybe this kind of null compression (and other kinds of compression
> like dictionary encoding and using non-multiples of 8 bits) belong at the
> application layer?
> >
>
> Perhaps so -- several people have asked me about dictionary encoding
> in the context of Arrow. In that particular case, a dictionary-encoded
> array is really two arrays -- an integer index array plus the
> dictionary array -- interpreted as an array of the logical type of the
> dictionary. That seems to me like a problem of the metadata or
> application domain. Of course, we can provide fast algorithms within
> Arrow for dictionary encoding and other such standard operations.
>
> > Julian
> >
>

Re: Format: storing null count + required/non-nullable types

Reply via email to