On 2006-08-19 16:54:36, Peter Maas wrote: > Gerhard Fiedler wrote: >> Well, ASCII can represent the Unicode numerically -- if that is what the OP >> wants. > > No. ASCII characters range is 0..127 while Unicode characters range is > at least 0..65535.
Actually, Unicode goes beyond 65535. But right in this sentence, you represented the number 65535 with ASCII characters, so it doesn't seem to be impossible. >> For example, "U+81EC" (all ASCII) is one possible -- not very >> readable though <g> -- representation of a Hanzi character (see >> http://www.cojak.org/index.php?function=code_lookup&term=81EC). > > U+81EC means a Unicode character which is represented by the number > 0x81EC. Exactly. Both versions represented in ASCII right in your message :) > UTF-8 maps Unicode strings to sequences of bytes in the range 0..255, > UTF-7 maps Unicode strings to sequences of bytes in the range 0..127. > You *could* read the latter as ASCII sequences but this is not correct. Of course not "correct". I guess the only "correct" representation is the original Chinese character. But the OP doesn't seem to want this... so a non-"correct" representation is necessary anyway. > How to do it in Python? Let chinesePhrase be a Unicode string with > Chinese content. Then > > chinesePhrase_7bit = chinesePhrase.encode('utf-7') > > will produce a sequences of bytes in the range 0..127 representing > chinesePhrase and *looking like* a (meaningless) ASCII sequence. Actually, no. There are quite a few code positions in the range 0..127 that don't "look like" anything (non-printable). And, as you say, this is rather meaningless. > chinesePhrase_16bit = chinesePhrase.encode('utf-16be') > > will produce a sequence with Unicode numbers packed in a byte > string in big endian order. This is probably closest to what > the OP wants. That's what you think... but it's not really ASCII. If you want this in ASCII, and readable, I still suggest to transform this sequence of 2-byte values (for Chinese characters it will be 2 bytes per character) into a sequence of something like U+81EC (or 0x81EC if you are a C fan or 81EC if you can imply the rest)... that's where we come back to my original suggestion :) Gerhard -- http://mail.python.org/mailman/listinfo/python-list