Re: Unicode literals and byte string interpretation.

Fletcher Johnson Mon, 31 Oct 2011 21:22:26 -0700

On Oct 28, 3:06 am, Steven D'Aprano <steve
+comp.lang.pyt...@pearwood.info> wrote:
> On Thu, 27 Oct 2011 20:05:13 -0700, Fletcher Johnson wrote:
> > If I create a newUnicodeobject u'\x82\xb1\x82\xea\x82\xcd' how does
> > this creation process interpret the bytes in the byte string?
>
> It doesn't, because there is no byte-string. You have created aUnicode
> object from aliteralstring ofunicodecharacters, not bytes. Those
> characters are:
>
> Dec Hex  Char
> 130 0x82 ‚
> 177 0xb1 ±
> 130 0x82 ‚
> 234 0xea ê
> 130 0x82 ‚
> 205 0xcd Í
>
> Don't be fooled that all of the characters happen to be in the range
> 0-255, that is irrelevant.
>
> > Does it
> > assume the string represents a utf-16 encoding, at utf-8 encoding,
> > etc...?
>
> None of the above. It assumes nothing. It takes a string of characters,
> end of story.
>
> > For reference the string is これは in the 'shift-jis' encoding.
>
> No it is not. The way to get aunicodeliteralwith those characters is
> to use aunicode-aware editor or terminal:
>
> >>> s = u'これは'
> >>> for c in s:
>
> ...     print ord(c), hex(ord(c)), c
> ...
> 12371 0x3053 こ
> 12428 0x308c れ
> 12399 0x306f は
>
> You are confusing characters with bytes. I believe that what you are
> thinking of is the following: you start with a byte string, and then
> decode it intounicode:
>
> >>> bytes = '\x82\xb1\x82\xea\x82\xcd'  # not u'...'
> >>> text = bytes.decode('shift-jis')
> >>> print text
>
> これは
>
> If you get the encoding wrong, you will get the wrong characters:
>
> >>> print bytes.decode('utf-16')
>
> 놂춂
>
> If you start with theUnicodecharacters, you can encode it into various
> byte strings:
>
> >>> s = u'これは'
> >>> s.encode('shift-jis')
>
> '\x82\xb1\x82\xea\x82\xcd'>>> s.encode('utf-8')
>
> '\xe3\x81\x93\xe3\x82\x8c\xe3\x81\xaf'
>
> --
> Steven


Thanks Steven. You are right. I was confusing characters with bytes.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Unicode literals and byte string interpretation.

Reply via email to