bug#38627: uniq -c gets wrong count with non-ascii strings

2019-12-15 Thread Roy Smith
With the following input: > $ cat x > "ⁿᵘˡˡ" > "ܥܝܪܐܩ" Running "uniq -c" says there's two copies of the same line! > $ uniq -c x > 2 "ⁿᵘˡˡ" I've attached a copy of the test file, and here's the octal dump: > $ od -b x > 000 042 342 201 277 341 265 230 313 241 313 241 042 012 042 33

bug#38627: uniq -c gets wrong count with non-ascii strings

2019-12-16 Thread Roy Smith
Yup, this does depend on the locale. In my original example, I had LANG=en_US.UTF-8. Setting it to C.UTF-8 gets me the right result: > $ LANG=C.UTF-8 uniq -c x > 1 "ⁿᵘˡˡ" > 1 "ܥܝܪܐܩ" But, that doesn't fully explain what's going on. I find it difficult to believe that there's any

bug#38627: uniq -c gets wrong count with non-ascii strings

2019-12-17 Thread Roy Smith
, iraq); printf("m = %d\n", m); } That correctly says the strings are different: $ LANG=en_US.UTF-8 ./a.out ⁿᵘˡˡ ܥܝܪܐܩ m = 6 > On Dec 16, 2019, at 7:46 PM, Roy Smith wrote: > > Yup, this does depend on the locale. In my original example, I had > LANG=en_US.UTF-8