On Wed, Jun 1, 2011 at 2:31 AM, Prasad, Ramit <ramit.pra...@jpmchase.com> wrote: >>line = unicode(line.strip(),'utf8') >>and now i get really utf8-strings. It does work but i dont know why it works. >>For me it looks like i change an utf8-string to an utf8-string. > > > I would like to point out that UTF-8 is not exactly "Unicode". From what I > understand, Unicode is a standard while UTF-8 is like an implementation of > that standard (called an encoding). Being able to convert to Unicode (the > standard) should mean you are then able to convert to any encoding that > supports the Unicode characters used.
Unicode defines characters; UTF-8 is one way (of many) to represent those characters in bytes. UTF-16 and UTF-32 are other ways of representing those characters in bytes, and internally, Python probably uses one of them - but there is no guarantee, and you should never need to know. Unicode strings can be stored in memory and manipulated in various ways, but they're a high level construct on par with lists and dictionaries - they can't be stored on disk or transmitted to another computer without using an encoding system. UTF-8 is an efficient way to translate Unicode text consisting primarily of low codepoint characters into bytes. It's not so much an implementation of Unicode as a means of converting a mythical concept of "Unicode characters" into a concrete stream of bytes. Hope that clarifies things a little! Chris Angelico -- http://mail.python.org/mailman/listinfo/python-list