bob gailer wrote:
 >>> "™"
'\xe2\x84\xa2'

What is this hex string?

Presumably you are running Python 2, yes? I will assume that you are using Python 2 in the following explanation.

You have just run smack bang into a collision between text and bytes, and in particular, the fact that your console is probably Unicode aware, but Python so-called strings are by default bytes and not text.

When you enter "™", your console is more than happy to allow you to enter a Unicode trademark character[1] and put it in between " " delimiters. This creates a plain bytes string. But the ™ character is not a byte, and shouldn't be treated as one -- Python should raise an error, but in an effort to be helpful, instead it tries to automatically encode that character to bytes using some default encoding. (Probably UTF-8.) The three hex bytes you actually get is the encoding of the TM character.

Python 2 does have proper text strings, but you have to write it as a unicode string:

py> s = u"™"
py> len(s)
1
py> s
u'\u2122'
py> print s
™
py> s.encode('utf-8')
'\xe2\x84\xa2'

Notice that encoding the trademark character to UTF-8 gives the same sequence of bytes as Python guesses on your behalf, which supports my guess that it is using UTF-8 by default.

If you take the three character byte string and decode it using UTF-8, you will get the trademark character back.

If all my talk of encodings doesn't mean anything to you, you should read this:

http://www.joelonsoftware.com/articles/Unicode.html




[1] Assuming your console is set to use the same encoding as my mail client is using. Otherwise I'm seeing something different to you.

--
Steven
_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Reply via email to