[Bug 247494] sort(1) order affected by LC_CTYPE

bugzilla-noreply Tue, 23 Jun 2020 07:14:24 -0700

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=247494


--- Comment #1 from Conrad Meyer <c...@freebsd.org> ---
On CURRENT:

$ LC_CTYPE=ja_JP.UTF-8 LC_COLLATE=C LANG=C locale
LANG=C
LC_CTYPE=ja_JP.UTF-8
LC_COLLATE="C"
LC_TIME="C"
LC_NUMERIC="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_ALL=

sort(1) attempts to identify situations where it can run in fast, byte-compare
only mode by looking only at LC_COLLATE.  The --debug option shows more
information:

$ (echo 耳 ; echo 脳 ; echo 耳) | LC_CTYPE=ja_JP.UTF-8 LC_COLLATE=C LANG=C sort
--debug
Memory to be used for sorting: 17100230656
Using collate rules of C locale
Byte sort is used
sort_method=radixsort
; offset=1
; k1=<耳>(1), k2=<脳>(1); offset=1; s1=<耳>, s2=<脳>; cmp1=0
; offset=1
; k1=<脳>(1), k2=<耳>(1); offset=1; s1=<脳>, s2=<耳>; cmp1=0
耳
脳
耳

Both compares seem wrong.  The UTF-8 sequences share only the first byte, 0xe8.

In LC_CTYPE=C mode:

; offset=1
; k1=<耳>(3), k2=<脳>(3); offset=1; s1=<耳>, s2=<脳>; cmp1=-4
; offset=1
; k1=<脳>(3), k2=<耳>(3); offset=1; s1=<脳>, s2=<耳>; cmp1=4
; offset=1
; k1=<耳>(3), k2=<耳>(3); offset=1; s1=<耳>, s2=<耳>; cmp1=0
耳
耳
脳

The comparisons look correct.  I will look a little more.  I think this is a
bug, not design, but I am not sure yet.

-- 
You are receiving this mail because:
You are the assignee for the bug.
_______________________________________________
freebsd-bugs@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-bugs
To unsubscribe, send any mail to "freebsd-bugs-unsubscr...@freebsd.org"

[Bug 247494] sort(1) order affected by LC_CTYPE

Reply via email to