Hi,

As discussed on the mailing list [1], it has been proposed to allow
the use of unsigned dictionary indices (which is already technically
possible in our metadata serialization, but not allowed according to
the language of the columnar specification), with the following
caveats:

* Unless part of an application's requirements (e.g. if it is
necessary to store dictionaries with size 128 to 255 more compactly),
implementations are recommended to prefer signed over unsigned
integers, with int32 continuing to be the "default" when the indexType
field of DictionaryEncoding is null
* uint64 dictionary indices, while permitted, are strongly not
recommended unless required by an application as they are more
difficult to work with in some programming languages (e.g. Java) and
they do not offer the storage size benefits that uint8 and uint16 do.

This change is backwards compatible, but not forward compatible for
all implementations (for example, C++ will reject unsigned integers).
Assuming that the V5 MetadataVersion change is accepted, to protect
against forward compatibility issues such implementations would be
recommended to not allow unsigned dictionary indices to be serialized
using V4 MetadataVersion.

A PR with the changes to the columnar specification (possibly subject
to some clarifying language) is at [2].

The vote will be open for at least 72 hours.

[ ] +1 Accept changes to allow unsigned integer dictionary indices
[ ] +0
[ ] -1 Do not accept because...

[1]: 
https://lists.apache.org/thread.html/r746e0a76c4737a2cf48dec656103677169bebb303240e62ae1c66d35%40%3Cdev.arrow.apache.org%3E
[2]: https://github.com/apache/arrow/pull/7567

Reply via email to