hi folks,

welcome to all! It's great to see so many people excited about our
plans to make data systems faster and more interoperable.

In thinking about building some initial Arrow integrations, I've run
into a couple of inter-related format questions.

The first is a proposal to add a null count to Arrow arrays. With
optional/nullable data, null_count == 0 will allow algorithms to skip
the null-handling code paths and treat the data as
required/non-nullable, yielding performance benefits. For example:

if (arr->null_count() == 0) {
  ...
} else {
  ...
}

Relatedly, at the data structure level, there is little semantic
distinction between these two cases

- Required / Non-nullable arrays
- Optional arrays with null count 0

My thoughts are that "required-ness" would best be minded at the
metadata / schema level, rather than tasking the lowest tier of data
structures and algorithms with handling the two semantically distinct,
but functionally equivalent forms of data without nulls. When
performing analytics, it adds complexity as some operations may
introduce or remove nulls, which would require type metadata to be
massaged i.e.:

function(required input) -> optional output, versus

function(input [null_count == 0]) -> output [maybe null_count > 0].

In the latter case, algorithms set bits and track the number of nulls
while constructing the output Arrow array; the former adds some extra
complexity.

The question of course, is where to enforce "required" in data
interchange. If two systems have agreed (through exchange of
schemas/metadata) that a particular batch of Arrow data is
non-nullable, I would suggest that the null_count == 0 contract be
validated at that point.

Curious to hear others' thoughts on this, and please let me know if I
can clarify anything I've said here.

best,
Wes

Reply via email to