bug#9740: Bug in sort

Eric Blake Wed, 12 Oct 2011 12:03:49 -0700

tag 9740 notabug
thanks

On 10/12/2011 12:41 PM, Lluís Padró wrote:

I found a bug in the "sort" utility that happens under utf8 locales, though
no character beyond basic ascii is involved in it...

Thanks for the report; however, this is almost certainly a case of yourlocale defining a different collation order than what you wereexpecting. See the FAQ:

https://www.gnu.org/software/coreutils/faq/#Sort-does-not-sort-in-normal-order_0021

I'm using "sort (GNU coreutils) 7.4" from package
"coreutils-7.4-2ubuntu3" on ubuntu lucid 10.04.03 LTS

The latest version of coreutils, 8.14, includes a --debug option thatmakes it even more apparent why sort is behaving correctly:

## Let's try another locale
~$ export LC_ALL="en_US.UTF-8"

## Sort fails. Shorter words are sorted after longer words with the same
prefix.
~$ sort testfile
abcd Z
abce Z
abc Z
ab Z

$ printf 'abc Z\nab Z\nabcd Z\nabce Z\n' | sort --debug
sort: using `en_US.UTF-8' sorting rules
abcd Z
______
abce Z
______
abc Z
_____
ab Z
____

So, what exactly is sort comparing? The entire line (because you didn'tspecify any -k options to limit it to fields). And how does it do thecomparison? By strcoll("abcd Z", "abc Z"). And how does strcoll()behave in the en_US.UTF-8 locale? By dictionary collation - that is,case and punctuation (including space) are ignored. So you get the sameanswer for both strcoll("abcd Z", "abc Z") and for strcoll("abcdz","abcz") in that locale, and sure enough, d comes before z, so the sortis correct.

You already figured out that LC_ALL=C forces sorting to honor bytevalues. But if you insist on using en_US collation, then maybe youshould also look at forcing the sort to honor specific fields:

$ printf 'abc Z\nab Z\nabcd Z\nabce Z\n' | sort --debug -sb -k1,1 -k2,2
sort: using `en_US.UTF-8' sorting rules
ab Z
__
   _
abc Z
___
    _
abcd Z
____
     _
abce Z
____
     _


--
Eric Blake   ebl...@redhat.com    +1-801-349-2682
Libvirt virtualization library http://libvirt.org

bug#9740: Bug in sort

Reply via email to