> I have a problem with tr, version 5.94 : > I'm using debian with a 100% utf-8 system. It is not an x-term related > problem (this also occurs in a vt). Quoting the arguments (tr "é" "e") > does not help.
Thanks for the report. However, upstream coreutils does not yet support multi-byte characters. The TODO file documents the need for a nice patch that handles multibyte characters cleanly, while not penalizing speed of strict single-byte locales; and so far, while several vendors have provided add-on patches that attempt this, none of them have been considered clean enough to apply upstream. > [EMAIL PROTECTED]:~$ echo hello | tr o a # no problem here > hella Even in utf-8, all these characters are single bytes. > > [EMAIL PROTECTED]:~$ echo hé | tr é e # why do I get 2 'e' ? > hee In utf-8, é occupies 2 bytes, but e occupies one, and single-byte translation is occuring, so this bit from the info pages is relevant: "On the other hand, making SET1 longer than SET2 is not portable; POSIX says that the result is undefined. In this situation, BSD `tr' pads SET2 to the length of SET1 by repeating the last character of SET2 as many times as necessary. System V `tr' truncates SET1 to the length of SET2." Thus, both utf-8 bytes of é are being translated into the expanded SET2 of ee. > > [EMAIL PROTECTED]:~$ echo hé | tr à a # here tr should do nothing... > ha(c) > Again, é and à are multibyte, and share a common byte, so with single-byte translation, the common byte is translated to a, and the remaining byte is passed through unchanged but now forms an illegal utf-8 sequence. -- Eric Blake _______________________________________________ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils