bug#38627: uniq -c gets wrong count with non-ascii strings

2020-02-23 Thread Paul Eggert
On 2/23/20 11:43 AM, Pádraig Brady wrote: #include "hard-locale.h" #include "posixver.h" #include "stdio--.h" -#include "xmemcoll.h" Please also remove the '#include "hard-locale.h"' line. Thanks for fixing this.

bug#38627: uniq -c gets wrong count with non-ascii strings

2020-02-23 Thread Andreas Schwab
On Feb 23 2020, Pádraig Brady wrote: > On 17/12/2019 17:25, Roy Smith wrote: >> I stopped short of actually building uniq.c from source (bootstrap, >> prerequisites, ...), but looking at the code, it looks like the call chain >> is: >> >> different() >> xmemcoll() >> memcoll() >> strcoll() >> >>

bug#38627: uniq -c gets wrong count with non-ascii strings

2020-02-23 Thread Pádraig Brady
On 17/12/2019 17:25, Roy Smith wrote: I stopped short of actually building uniq.c from source (bootstrap, prerequisites, ...), but looking at the code, it looks like the call chain is: different() xmemcoll() memcoll() strcoll() so I tried a little test at the strcoll() level: #include #inclu

bug#38627: uniq -c gets wrong count with non-ascii strings

2019-12-17 Thread Bruno Haible
> However, the 2016 edition of POSIX removed mention of LC_COLLATE from 'uniq' Indeed. The change was done in . Quote: "On Page: 3309 Line: 111067 Section: uniq In the ENVIRONMENT VARIABLES section, delete: LC_COLLATE Determine the locale for ord

bug#38627: uniq -c gets wrong count with non-ascii strings

2019-12-17 Thread Jim Meyering
On Mon, Dec 16, 2019 at 1:41 AM Paul Eggert wrote: > On 12/15/19 11:40 AM, Roy Smith wrote: > > With the following input: > > > >> $ cat x > >> "ⁿᵘˡˡ" > >> "ܥܝܪܐܩ" > > > > > > Running "uniq -c" says there's two copies of the same line! > > > >> $ uniq -c x > >> 2 "ⁿᵘˡˡ" > > Thanks for the bu

bug#38627: uniq -c gets wrong count with non-ascii strings

2019-12-17 Thread Roy Smith
I stopped short of actually building uniq.c from source (bootstrap, prerequisites, ...), but looking at the code, it looks like the call chain is: different() xmemcoll() memcoll() strcoll() so I tried a little test at the strcoll() level: #include #include #include int main (int argc, char

bug#38627: uniq -c gets wrong count with non-ascii strings

2019-12-16 Thread Roy Smith
Yup, this does depend on the locale. In my original example, I had LANG=en_US.UTF-8. Setting it to C.UTF-8 gets me the right result: > $ LANG=C.UTF-8 uniq -c x > 1 "ⁿᵘˡˡ" > 1 "ܥܝܪܐܩ" But, that doesn't fully explain what's going on. I find it difficult to believe that there's any

bug#38627: uniq -c gets wrong count with non-ascii strings

2019-12-16 Thread Paul Eggert
On 12/15/19 11:40 AM, Roy Smith wrote: > With the following input: > >> $ cat x >> "ⁿᵘˡˡ" >> "ܥܝܪܐܩ" > > > Running "uniq -c" says there's two copies of the same line! > >> $ uniq -c x >> 2 "ⁿᵘˡˡ" Thanks for the bug report. I expect this is because GNU 'uniq' uses the equivalent of strcol

bug#38627: uniq -c gets wrong count with non-ascii strings

2019-12-15 Thread Roy Smith
With the following input: > $ cat x > "ⁿᵘˡˡ" > "ܥܝܪܐܩ" Running "uniq -c" says there's two copies of the same line! > $ uniq -c x > 2 "ⁿᵘˡˡ" I've attached a copy of the test file, and here's the octal dump: > $ od -b x > 000 042 342 201 277 341 265 230 313 241 313 241 042 012 042 33