https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=247494

--- Comment #2 from Conrad Meyer <c...@freebsd.org> ---
I think the lengths printed in the bad example are correct; that is a measure
of wchar_t's, whereas in LC_CTYPE=C, the length is in bytes.  So it seems like
it is a comparison problem.

I think we invoke wstrcoll() -> bwscoll() in the latter case.  bwscoll() seems
to be broken for short strings:

        if (len1 <= offset)
                return ((len2 <= offset) ? 0 : -1);

E.g., $ (echo a耳 ; echo a脳 ; echo a耳) | LC_CTYPE=ja_JP.UTF-8 LC_COLLATE=C
LANG=C sort --debug
...
; offset=1
; k1=<a耳>(2), k2=<a脳>(2); offset=1; s1=<a耳>, s2=<a脳>; cmp1=-256
; offset=1
; k1=<a脳>(2), k2=<a耳>(2); offset=1; s1=<a脳>, s2=<a耳>; cmp1=256
; offset=1
; k1=<a耳>(2), k2=<a耳>(2); offset=1; s1=<a耳>, s2=<a耳>; cmp1=0
a耳
a耳
a脳

The result is correct, because length (2) < offset (1).  I don't know if
'offset' here is wrong, or if bswcoll is wrong.  It seems like maybe it only
invokes bswcoll() on strings it thinks are identical from a radix perspective. 
So perhaps the problem is some combination of wcstr and byte_sort in radixsort.

In --mergesort mode, the result and comparisons are correct:

(echo 耳 ; echo 脳 ; echo 耳) | LC_CTYPE=ja_JP.UTF-8 LC_COLLATE=C LANG=C sort
--mergesort --debug
Memory to be used for sorting: 17100230656
Using collate rules of C locale
Byte sort is used
sort_method=mergesort
; k1=<耳>(1), k2=<脳>(1); s1=<耳>, s2=<脳>; cmp1=-256
; k1=<脳>(1), k2=<耳>(1); s1=<脳>, s2=<耳>; cmp1=256
; k1=<耳>(1), k2=<耳>(1); s1=<耳>, s2=<耳>; cmp1=0
耳
耳
脳

Something is broken in radixsort.

-- 
You are receiving this mail because:
You are the assignee for the bug.
_______________________________________________
freebsd-bugs@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-bugs
To unsubscribe, send any mail to "freebsd-bugs-unsubscr...@freebsd.org"

Reply via email to