On Sat, 10 Jan 2015 18:55:01 -0500 random...@fastmail.us wrote: > Why would someone want to use the decimal value of the UTF-8 bytes, > rather than the unicode codepoint?
Because it sadly is specified like this in the tr-document. > Why are you using decimal for a syntax that _universally_ means octal? It was an example to extend this "decimal" idea to UTF-8, but I totally agree with you that octal is a saner way to go. > UTF-8 is an encoding of Unicode. No-one actually thinks of the character > as being "C3B6" - it's 00F6, even if it happens to be encoded as C3 B6 > or F6 00 whatever. Nobody thinks of UTF-8 sequences as a single integer > unit. Well I do since I wrote the algorithm, however, what you probably mean is the matter of how they're expressed as input. > The sensible thing to do would be to extend the syntax with \u00F6 (and > \U00010000 for non-BMP characters) the way many other languages have > done it) This also avoids repeating the mistake of variable-length > escapes - \u is exactly 4 digits, and \U is exactly 8. If they're fixed length, they could be implemented. > GNU actually has a history of being behind the curve on UTF-8/multibyte > characters, so it's not a great example of "what POSIX requires". Cut is > another notable command with the same problem. No wonder why it's behind <.<. They can't even maintain their codebases properly. Cheers FRIGN -- FRIGN <d...@frign.de>