If that is the intended behavior, the bug is that: > printf '12,\n1,\n' | sort -t, -k1 -s 1, 12,
does _not_ take the remainder of the line into account, and only sorts on the initial field, prioritizing length. It is at the very least unexpected that adding an `a` to the end of both lines would change the sort order of those lines: > printf '12,a\n1,a\n' | sort -t, -k1 -s 12,a 1,a On Sun, Jul 12, 2020 at 11:58 PM Assaf Gordon <assafgor...@gmail.com> wrote: > tags 42340 notabug > close 42340 > stop > > Hello, > > On 2020-07-12 5:57 p.m., Beth Andres-Beck wrote: > > In trying to use `join` with `sort` I discovered odd behavior: even after > > running a file through `sort` using the same delimiter, `join` would > still > > complain that it was out of order. > [...] > > Here is a way to reproduce the problem: > > > >> printf '1.1.1,2\n1.1.12,2\n1.1.2,1' | sort -t, > a.txt > >> printf '1.1.12,a\n1.1.1,b\n1.1.21,c' | sort -t, > b.txt > >> join -t, a.txt b.txt > > join: b.txt:2: is not sorted: 1.1.1,b > > > > The expected behavior would be that if a file has been sorted by "sort" > it > > will also be considered sorted by join. > [...] > > I traced this back to what I believe to be a bug in sort.c > > This is not a bug in sort or join, just a side-effect of the locale on > your system on the sorting results. > > By forcing a C locale with "LC_ALL=C" (meaning simple ASCII order), > the files are ordered in the same way 'join' expected them to be: > > $ printf '1.1.1,2\n1.1.12,2\n1.1.2,1' | LC_ALL=C sort -t, > a.txt > $ printf '1.1.12,a\n1.1.1,b\n1.1.21,c' | LC_ALL=C sort -t, > b.txt > $ join -t, a.txt b.txt > 1.1.1,2,b > 1.1.12,2,a > > --- > > More details: > I'm going to assume your system uses some locale based on UTF-8. > You can check it by running 'locale', e.g. on my system: > $ locale > LANG=en_CA.utf8 > LANGUAGE=en_CA:en > LC_CTYPE="en_CA.utf8" > .. > .. > > Under most UTF-8 locales, punctuation characters are *ignored* in the > compared input lines. This might be confusing and non-intuitive, but > that's the way most systems have been working for many years (locale > ordering is defined in the GNU C Library, and coreutils has no way to > change it). > > Observe the following: > > $ printf '12,a\n1,b\n' | LC_ALL=en_CA.utf8 sort > 12,a > 1,b > > $ printf '12,a\n1,b\n' | LC_ALL=C sort > 1,b > 12,a > > With a UTF-8 locale, the comma character is ignored, and then "12a" > appears before "1b" (since the character '2' comes before the character > 'b'). > > With "C" locale, forcing ASCII or "byte comparison", punctuation > characters are not ignored, and "1,b" appears before "12,a" (because > the comma ',' ASCII value is 44 , which is smaller then the ASCII value > digit '2'). > > --- > > Somewhat related: > Your sort command defines the delimiter ("-t,") but does not define > which columns to sort by; sort then uses the entire input line - and > there's no need to specify delimiter at all. > > --- > > As such, I'm closing this as "not a bug", but discussion can continue by > replying to this thread. > > regards, > - assaf > >