On Mon, Dec 16, 2019 at 1:41 AM Paul Eggert <egg...@cs.ucla.edu> wrote: > On 12/15/19 11:40 AM, Roy Smith wrote: > > With the following input: > > > >> $ cat x > >> "ⁿᵘˡˡ" > >> "ܥܝܪܐܩ" > > > > > > Running "uniq -c" says there's two copies of the same line! > > > >> $ uniq -c x > >> 2 "ⁿᵘˡˡ" > > Thanks for the bug report. I expect this is because GNU 'uniq' uses the > equivalent of strcoll (locale-dependent comparison) to compare lines, whereas > macOS 'uniq' uses the equivalent of strcmp (byte comparison). Since the two > lines compare equal in your locale, GNU 'uniq' says there's just one line. > > The GNU 'uniq' behavior appears to be a consequence of this commit: > > commit 545c2323d493c7ed9c770d9b8e45a15db6f615bc > Author: Jim Meyering <j...@meyering.net> > Date: Fri Aug 2 14:42:37 2002 +0000 > > with a change noted this way in NEWS: > > * uniq now obeys the LC_COLLATE locale, as per POSIX 1003.1-2001 TC1. > > However, the 2016 edition of POSIX removed mention of LC_COLLATE from 'uniq', > and I expect this means that the 2002 commit should be reverted so that GNU > 'uniq' behaves like macOS 'uniq' (a behavior that I think makes more sense > anyway). > > I'll CC: this email to Jim Meyering to see whether he has an opinion about > this. > > In the meantime you can work around the problem by using 'LC_ALL=C uniq' > instead > of plain 'uniq' in your shell script.
Thanks for the report, Roy, and thanks Paul for diving in. I confess I haven't done more than look at that old diff, but this sure sounds like a bug we must fix, to get in line with the the much more recent POSIX spec.