Hello all, I think that guarantees on masked values are worthwhile to define for more than a single type in isolation. In particular, requiring this exclusively for Utf8View will leave Utf8 and LargeUtf8 as arrays which *may* legally have non-utf8 masked values but cannot be consumed by arrow-rs. Furthermore, this has been a pain point for the arrow format in the past since consumers would frequently benefit from such guarantees but have no way to negotiate them with producers. For example, R uses a specific NaN [1] to represent missing floating point values, and before 2.0 pandas did something similar.
I think the simplest way to express this in an arrow schema would be a new canonical metadata field, `ARROW:masked_value_guarantee` or so: if the KeyValue is present and the value is "safe", then masked values are safe to access as if they were not masked and satisfy all type constraints that an unmasked value would. This field can be set on a Schema (or maybe on a Field?) by a producer to certify that the safety guarantee holds for all data in a file/stream, and optimizations such as arrow-rs' whole-buffer utf-8 validation may be performed safely. I'd tentatively suggest that another useful value to associate with this metadata key would be "zero"- IE masked integers and floats are all zero, strings under null bits are empty, etc. Relatedly, in the C++ implementation prior to ARROW-2790 [2] masked values in new arrays were not initialized... *technically* fine if we never access those values but if they are zeroed then arithmetic can be performed independent of the null bitmap without causing sanitizers like valgrind to report access of unitialized values. I think the addition of this guarantee on masked values would be a useful formalization of performance/correctness decisions made in different ways by multiple implementations. Supporting this should not require overly invasive changes to the c++ implementation; I *think* it will only require specialization of the binary->utf8 cast. Re Utf8View: thanks for your implementation @Raphael! I'll review it soon. In the interest of completeness, I also plan to write Utf8View as a canonical extension type as described by @Weston to see how that looks. Sincerely, Ben Kietzman [1]: https://github.com/wch/r-source/blob/e8b7fcc/src/main/arithmetic.c#L90-L98 (R_ValueOfNA) [2]: https://issues.apache.org/jira/browse/ARROW-2790