On Fri, Jan 9, 2015, at 18:39, FRIGN wrote:
> C3B6 is 'ö' and makes sense to allow specifying it as \50102 (in the pure
> UTF-8-sense of course, nothing to do with collating).

Why would someone want to use the decimal value of the UTF-8 bytes,
rather than the unicode codepoint?

Why are you using decimal for a syntax that _universally_ means octal?

UTF-8 is an encoding of Unicode. No-one actually thinks of the character
as being "C3B6" - it's 00F6, even if it happens to be encoded as C3 B6
or F6 00 whatever. Nobody thinks of UTF-8 sequences as a single integer
unit.

The sensible thing to do would be to extend the syntax with \u00F6 (and
\U00010000 for non-BMP characters) the way many other languages have
done it) This also avoids repeating the mistake of variable-length
escapes - \u is exactly 4 digits, and \U is exactly 8.

> Well, probably I misunderstood the matter. Sometimes this stuff gets
> above my head. ;)
> At the end of the day, you want software to work as expected:
> 
> GNU tr:
> $ echo ελληνική | tr [α-ω] [Α-Ω]
> ®®®®®®®®®
> 
> our tr:
> $ echo ελληνικη | ./tr [α-ω] [Α-Ω]                                        
> ΕΛΛΗΝΙΚΗ

And that's fine. Actually I think POSIX actually _requires_ for it to
work the way yours does, and GNU fails to comply. As a data point, OSX
and FreeBSD both work the same way as sbase for this test case.

GNU actually has a history of being behind the curve on UTF-8/multibyte
characters, so it's not a great example of "what POSIX requires". Cut is
another notable command with the same problem.

Reply via email to