At Workbench, we've taken to validating UTF-8 for well-formedness when we
call Validate(). (After Arrow 1.0.0, this will be ValidateFull().)

As I understand it, ValidateFull() is optional. I imagine *all* Arrow users
calling ValidateFull() on a utf8 column would prefer to bail on ill-formed
UTF-8 byte sequences -- unless such a check is too slow.

After I deployed UTF8-checking code in Workbench[1], I've come to the
opinion that a well-formedness check is amply fast. Today's CPUs can
validate several gigabytes of text per second[2].

Checking UTF-8 for well-formedness is O(m+n) in the array length (m) and
number of bytes of text (n).

Would the Arrow team welcome a pull request that enhances ValidateFull() to
validate that utf8-column values are well-formed UTF-8 byte sequences?

Another validation we've added to Workbench is in column *names*. In
Arrow's IPC layer, `FieldFromFlatbuffer()` validates that column names are
not null. But it doesn't validate that column names are well-formed UTF-8.
The Flatbuffers spec says strings should be valid UTF-8. Should
`FieldFromFlatbuffer()` check?

Enjoy life,
Adam

[1]
https://github.com/CJWorkbench/arrow-tools/blob/ddc1a664ac3d0b78f4537e3e8e82ecc10c471ef8/src/arrow-validate.cc#L43
[2] https://github.com/cyb70289/utf8

-- 
Adam Hooper
+1-514-882-9694
http://adamhooper.com

Reply via email to