At Workbench, we've taken to validating UTF-8 for well-formedness when we call Validate(). (After Arrow 1.0.0, this will be ValidateFull().)
As I understand it, ValidateFull() is optional. I imagine *all* Arrow users calling ValidateFull() on a utf8 column would prefer to bail on ill-formed UTF-8 byte sequences -- unless such a check is too slow. After I deployed UTF8-checking code in Workbench[1], I've come to the opinion that a well-formedness check is amply fast. Today's CPUs can validate several gigabytes of text per second[2]. Checking UTF-8 for well-formedness is O(m+n) in the array length (m) and number of bytes of text (n). Would the Arrow team welcome a pull request that enhances ValidateFull() to validate that utf8-column values are well-formed UTF-8 byte sequences? Another validation we've added to Workbench is in column *names*. In Arrow's IPC layer, `FieldFromFlatbuffer()` validates that column names are not null. But it doesn't validate that column names are well-formed UTF-8. The Flatbuffers spec says strings should be valid UTF-8. Should `FieldFromFlatbuffer()` check? Enjoy life, Adam [1] https://github.com/CJWorkbench/arrow-tools/blob/ddc1a664ac3d0b78f4537e3e8e82ecc10c471ef8/src/arrow-validate.cc#L43 [2] https://github.com/cyb70289/utf8 -- Adam Hooper +1-514-882-9694 http://adamhooper.com