On Oct 28, 3:06 am, Steven D'Aprano <steve +comp.lang.pyt...@pearwood.info> wrote: > On Thu, 27 Oct 2011 20:05:13 -0700, Fletcher Johnson wrote: > > If I create a newUnicodeobject u'\x82\xb1\x82\xea\x82\xcd' how does > > this creation process interpret the bytes in the byte string? > > It doesn't, because there is no byte-string. You have created aUnicode > object from aliteralstring ofunicodecharacters, not bytes. Those > characters are: > > Dec Hex Char > 130 0x82 ‚ > 177 0xb1 ± > 130 0x82 ‚ > 234 0xea ê > 130 0x82 ‚ > 205 0xcd Í > > Don't be fooled that all of the characters happen to be in the range > 0-255, that is irrelevant. > > > Does it > > assume the string represents a utf-16 encoding, at utf-8 encoding, > > etc...? > > None of the above. It assumes nothing. It takes a string of characters, > end of story. > > > For reference the string is これは in the 'shift-jis' encoding. > > No it is not. The way to get aunicodeliteralwith those characters is > to use aunicode-aware editor or terminal: > > >>> s = u'これは' > >>> for c in s: > > ... print ord(c), hex(ord(c)), c > ... > 12371 0x3053 こ > 12428 0x308c れ > 12399 0x306f は > > You are confusing characters with bytes. I believe that what you are > thinking of is the following: you start with a byte string, and then > decode it intounicode: > > >>> bytes = '\x82\xb1\x82\xea\x82\xcd' # not u'...' > >>> text = bytes.decode('shift-jis') > >>> print text > > これは > > If you get the encoding wrong, you will get the wrong characters: > > >>> print bytes.decode('utf-16') > > 놂춂 > > If you start with theUnicodecharacters, you can encode it into various > byte strings: > > >>> s = u'これは' > >>> s.encode('shift-jis') > > '\x82\xb1\x82\xea\x82\xcd'>>> s.encode('utf-8') > > '\xe3\x81\x93\xe3\x82\x8c\xe3\x81\xaf' > > -- > Steven
Thanks Steven. You are right. I was confusing characters with bytes. -- http://mail.python.org/mailman/listinfo/python-list