On Fri, Jan 9, 2015, at 18:39, FRIGN wrote: > C3B6 is 'ö' and makes sense to allow specifying it as \50102 (in the pure > UTF-8-sense of course, nothing to do with collating).
Why would someone want to use the decimal value of the UTF-8 bytes, rather than the unicode codepoint? Why are you using decimal for a syntax that _universally_ means octal? UTF-8 is an encoding of Unicode. No-one actually thinks of the character as being "C3B6" - it's 00F6, even if it happens to be encoded as C3 B6 or F6 00 whatever. Nobody thinks of UTF-8 sequences as a single integer unit. The sensible thing to do would be to extend the syntax with \u00F6 (and \U00010000 for non-BMP characters) the way many other languages have done it) This also avoids repeating the mistake of variable-length escapes - \u is exactly 4 digits, and \U is exactly 8. > Well, probably I misunderstood the matter. Sometimes this stuff gets > above my head. ;) > At the end of the day, you want software to work as expected: > > GNU tr: > $ echo ελληνική | tr [α-ω] [Α-Ω] > ®®®®®®®®® > > our tr: > $ echo ελληνικη | ./tr [α-ω] [Α-Ω] > ΕΛΛΗΝΙΚΗ And that's fine. Actually I think POSIX actually _requires_ for it to work the way yours does, and GNU fails to comply. As a data point, OSX and FreeBSD both work the same way as sbase for this test case. GNU actually has a history of being behind the curve on UTF-8/multibyte characters, so it's not a great example of "what POSIX requires". Cut is another notable command with the same problem.