Hello Benson, Le 11/10/2021 à 19:56, Benson Muite a écrit :
When comparing strings using C++, the default behavior is to order by UTF8 codepoints which impacts comparing strings such as a < b < c [1][2]. This may not be appropriate in all cases and like in the sort function [3], it may be helpful to have an optional field for comparison keys.
It's certainly not appropriate in most cases except the most rudimentary use cases (for example if keys are ASCII-only). We should ideally implement the official Unicode collation algorithm, however it is a non-trivial endeavour. See the already opened issue at https://issues.apache.org/jira/browse/ARROW-12046
Regards Antoine.