Hello all, Utf8View was recently accepted [1] and I've opened a PR to add the spec/schema changes [2]. In review [3], it was requested that signed 32 bit integers be used for the fields of view structs instead of 32 bit unsigned.
This divergence has been discussed on the ML previously [4], but in light of my reviewer's request for a change it should be raised again for focused discussion. (At this stage, I don't *think* the change would require another vote.) I'll enumerate the motivations for signed and unsigned as I understand them. Signed: - signed integers are conventional in the arrow format - unsigned integers may cause some difficulty of implementation in languages which don't natively support them Unsigned: - unsigned integers are used by engines which already implement Utf8View My own bias is toward compatibility with existing implementers, but using signed integers will only affect the case of arrays which include data buffers larger than 2GB. For reference, the default buffer size in velox is 32KB so such a massive data buffer would only occur when a single slot of a string array has 2.1GB of characters. This seems sufficiently unlikely that I wouldn't consider it a blocker. Sincerely, Ben Kietzman [1] https://lists.apache.org/thread/wt9j3q7qd59cz44kyh1zkts8s6wo1dn6 [2] https://github.com/apache/arrow/pull/37526 [3] https://github.com/apache/arrow/pull/37526#discussion_r1323029022 [4] https://lists.apache.org/thread/w88tpz76ox8h3rxkjl4so6rg3f1rv7wt [5] https://github.com/facebookincubator/velox/blob/947d98c99a7cf05bfa4e409b1542abc89a28cb29/velox/vector/FlatVector.h#L46-L50