hi folks, welcome to all! It's great to see so many people excited about our plans to make data systems faster and more interoperable.
In thinking about building some initial Arrow integrations, I've run into a couple of inter-related format questions. The first is a proposal to add a null count to Arrow arrays. With optional/nullable data, null_count == 0 will allow algorithms to skip the null-handling code paths and treat the data as required/non-nullable, yielding performance benefits. For example: if (arr->null_count() == 0) { ... } else { ... } Relatedly, at the data structure level, there is little semantic distinction between these two cases - Required / Non-nullable arrays - Optional arrays with null count 0 My thoughts are that "required-ness" would best be minded at the metadata / schema level, rather than tasking the lowest tier of data structures and algorithms with handling the two semantically distinct, but functionally equivalent forms of data without nulls. When performing analytics, it adds complexity as some operations may introduce or remove nulls, which would require type metadata to be massaged i.e.: function(required input) -> optional output, versus function(input [null_count == 0]) -> output [maybe null_count > 0]. In the latter case, algorithms set bits and track the number of nulls while constructing the output Arrow array; the former adds some extra complexity. The question of course, is where to enforce "required" in data interchange. If two systems have agreed (through exchange of schemas/metadata) that a particular batch of Arrow data is non-nullable, I would suggest that the null_count == 0 contract be validated at that point. Curious to hear others' thoughts on this, and please let me know if I can clarify anything I've said here. best, Wes