On Thu, 27 Oct 2011 20:05:13 -0700, Fletcher Johnson wrote: > If I create a new Unicode object u'\x82\xb1\x82\xea\x82\xcd' how does > this creation process interpret the bytes in the byte string?
It doesn't, because there is no byte-string. You have created a Unicode object from a literal string of unicode characters, not bytes. Those characters are: Dec Hex Char 130 0x82 177 0xb1 ± 130 0x82 234 0xea ê 130 0x82 205 0xcd Í Don't be fooled that all of the characters happen to be in the range 0-255, that is irrelevant. > Does it > assume the string represents a utf-16 encoding, at utf-8 encoding, > etc...? None of the above. It assumes nothing. It takes a string of characters, end of story. > For reference the string is これは in the 'shift-jis' encoding. No it is not. The way to get a unicode literal with those characters is to use a unicode-aware editor or terminal: >>> s = u'これは' >>> for c in s: ... print ord(c), hex(ord(c)), c ... 12371 0x3053 こ 12428 0x308c れ 12399 0x306f は You are confusing characters with bytes. I believe that what you are thinking of is the following: you start with a byte string, and then decode it into unicode: >>> bytes = '\x82\xb1\x82\xea\x82\xcd' # not u'...' >>> text = bytes.decode('shift-jis') >>> print text これは If you get the encoding wrong, you will get the wrong characters: >>> print bytes.decode('utf-16') 놂춂 If you start with the Unicode characters, you can encode it into various byte strings: >>> s = u'これは' >>> s.encode('shift-jis') '\x82\xb1\x82\xea\x82\xcd' >>> s.encode('utf-8') '\xe3\x81\x93\xe3\x82\x8c\xe3\x81\xaf' -- Steven -- http://mail.python.org/mailman/listinfo/python-list