On Wed, Jun 1, 2011 at 2:31 AM, Prasad, Ramit <ramit.pra...@jpmchase.com> wrote:
>>line = unicode(line.strip(),'utf8')
>>and now i get really utf8-strings. It does work but i dont know why it works. 
>>For me it looks like i change an utf8-string to an utf8-string.
>
>
> I would like to point out that UTF-8 is not exactly "Unicode". From what I 
> understand, Unicode is a standard while UTF-8 is like an implementation of 
> that standard (called an encoding). Being able to convert to Unicode (the 
> standard) should mean you are then able to convert to any encoding that 
> supports the Unicode characters used.

Unicode defines characters; UTF-8 is one way (of many) to represent
those characters in bytes. UTF-16 and UTF-32 are other ways of
representing those characters in bytes, and internally, Python
probably uses one of them - but there is no guarantee, and you should
never need to know. Unicode strings can be stored in memory and
manipulated in various ways, but they're a high level construct on par
with lists and dictionaries - they can't be stored on disk or
transmitted to another computer without using an encoding system.

UTF-8 is an efficient way to translate Unicode text consisting
primarily of low codepoint characters into bytes. It's not so much an
implementation of Unicode as a means of converting a mythical concept
of "Unicode characters" into a concrete stream of bytes.

Hope that clarifies things a little!

Chris Angelico
-- 
http://mail.python.org/mailman/listinfo/python-list

Reply via email to