Hello all,

I think that guarantees on masked values are worthwhile to define for more
than a
single type in isolation. In particular, requiring this exclusively for
Utf8View
will leave Utf8 and LargeUtf8 as arrays which *may* legally have non-utf8
masked
values but cannot be consumed by arrow-rs. Furthermore, this has been a
pain point
for the arrow format in the past since consumers would frequently benefit
from such
guarantees but have no way to negotiate them with producers. For example, R
uses a
specific NaN [1] to represent missing floating point values, and before 2.0
pandas
did something similar.

I think the simplest way to express this in an arrow schema would be a new
canonical
metadata field, `ARROW:masked_value_guarantee` or so: if the KeyValue is
present and
the value is "safe", then masked values are safe to access as if they were
not masked
and satisfy all type constraints that an unmasked value would. This field
can be set
on a Schema (or maybe on a Field?) by a producer to certify that the safety
guarantee
holds for all data in a file/stream, and optimizations such as arrow-rs'
whole-buffer
utf-8 validation may be performed safely. I'd tentatively suggest that
another useful
value to associate with this metadata key would be "zero"- IE masked
integers and
floats are all zero, strings under null bits are empty, etc.

Relatedly, in the C++ implementation prior to ARROW-2790 [2] masked values
in new arrays were
not initialized... *technically* fine if we never access those values but
if they are zeroed
then arithmetic can be performed independent of the null bitmap without
causing sanitizers
like valgrind to report access of unitialized values. I think the addition
of this guarantee
on masked values would be a useful formalization of performance/correctness
decisions
made in different ways by multiple implementations.

Supporting this should not require overly invasive changes to the c++
implementation;
I *think* it will only require specialization of the binary->utf8 cast.

Re Utf8View: thanks for your implementation @Raphael! I'll review it soon.
In the interest
of completeness, I also plan to write Utf8View as a canonical extension
type as described
by @Weston to see how that looks.

Sincerely,
Ben Kietzman

[1]:
https://github.com/wch/r-source/blob/e8b7fcc/src/main/arithmetic.c#L90-L98
(R_ValueOfNA)
[2]: https://issues.apache.org/jira/browse/ARROW-2790

Reply via email to