Randall Lewis wrote: > "sort" does inconsistent sorting. You are sure about that? :-)
> I'm pretty sure it has NOTHING to do with the following warning, > although I could be totally wrong. > > " *** WARNING *** > The locale specified by the environment affects sort order. > Set LC_ALL=C to get the traditional sort order that uses > native byte values. " You read this, know that sort will base the sorting upon the locale setting, but didn't tell us what locale you were using to sort? Shame on you. Because you *know* I am going to ask you about it! :-) What locale are you using? C? en_US.UTF-8? Some other? The locale command will print this information. Here is an example from my system. $ locale LANG=en_US.UTF-8 LC_CTYPE="en_US.UTF-8" LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_COLLATE=C LC_MONETARY="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_PAPER="en_US.UTF-8" LC_NAME="en_US.UTF-8" LC_ADDRESS="en_US.UTF-8" LC_TELEPHONE="en_US.UTF-8" LC_MEASUREMENT="en_US.UTF-8" LC_IDENTIFICATION="en_US.UTF-8" LC_ALL= > sort test1.txt > 323|1 > 36|2 > 40|4 > 406|3 > 587|5 > sort test7.txt > 323|B1 > 36|C2 > 406|B3 > 40|B4 > 587|C5 Looks okay to me for the en_US.UTF-8 locale. But it will of course be different in the C locale. $ LC_ALL=en_US.UTF-8 sort test1.txt 323|1 36|2 40|4 406|3 587|5 $ LC_ALL=C sort test1.txt 323|1 36|2 406|3 40|4 587|5 What ordering did you expect there? I assume you are expecting to see these sorted as in the C locale? > The rows are in a different order depending on the dataset--and it > is NOT a numeric sort. I'm not even sure it is is ANY type of sort. It is a character sort. A string sort. It is comparing the line of characters from start to finish. But it uses the system's collation tables based upon the locale. In the en_US.UTF-8 locale punctuation is ignored and case is folded. I don't like it but the powers that be have decreed it. Please see the FAQ: http://www.gnu.org/software/coreutils/faq/#Sort-does-not-sort-in-normal-order_0021 The standards documentation: http://www.opengroup.org/onlinepubs/009695399/utilities/sort.html Variables that control localization: http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap08.html#tag_08_02 > sort -k1 -t "|" test1.txt Hint: If you ever think you need to use -k POS1 then you almost always should be using -k POS1,POS2 to specify where you want the sort to stop comparing. Otherwise it compares all of the way to the end of the line. > But why did it sort inconsistently in the first place based on the > other contents of the file rather than just focusing on the first > column--even when I told it to? You never told it not to continue comparing all of the way to the end of the line. For example this way: $ sort -t'|' -k1,1n -k2,2n test1.txt 36|2 40|4 323|1 406|3 587|5 That won't help you with join since that expects a non-numeric sort ordering. > Inconsistent sorting when combined with 'join' provides incorrect > matches and duplication of records. This is a mess. Yes. Recent versions of join detect and warn about this. Recent versions of sort have a --debug option that can help to identify problem cases. Bob