On Sun, Jan 31, 2010 at 5:12 PM, Tracubik <affdfsdfds...@b.com> wrote: > Il Sun, 31 Jan 2010 13:46:16 +0100, Günther Dietrich ha > scritto: > >> Maybe you might solve this if you decode your string to unicode. >> Example: >> >> |>>> euro = "€" >> |>>> len(euro) >> |3 >> |>>> u_euro = euro.decode('utf_8') >> |>>> len(u_euro) >> |1 >> >> Adapt the encoding ('utf_8' in my example) to whatever you use. >> >> Or create the unicode string directly: >> >> |>>> u_euro = u'€' >> |>>> len(u_euro) >> |1 >> >> >> >> Best regards, >> >> Günther > > thank you, your two solution is really interesting. > is there a possible to set unicode encoding by default for my python > scripts? > i've tried inserting > # -*- coding: utf-8 -*- > > at the beginning of my script but doesn't solve the problem
First of all, if you haven't read this before, please do. It will make this much clearer. http://www.joelonsoftware.com/articles/Unicode.html To reiterate: UTF-8 IS NOT UNICODE!!!! In Python 2, '*' signifies a byte string. It is read as a sequence of bytes and interpreted as a sequence of bytes When Python encounters the sequence 0x27 0xe2 0x82 0xac 0x27 in the code (the UTF-8 bytes for '€') it interprets it as 3 bytes between the two quotes. It doesn't care about characters or anything like that. u'*' signifies a Unicode string. Python will attempt to convert the sequence of bytes into a sequence of characters. It can use any encoding for that: cp1252, utf-8, MacRoman, ISO-8859-15. UTF-8 isn't special, it's just one of the few encodings capable of storing all of the possible Unicode characters. What the line at the top says is that the file should be read using UTF-8. Byte strings are still just sequences of bytes- this doesn't affect them. But any Unicode string will be decoded using UTF-8. IF python looks at the above sequence of bytes as a Unicode string, it views the 3 bytes as a single character. When you ask for it's length, it returns the number of characters. Solution to your problem: in addition to keeping the #-*- coding ... line, go with Günther's advice and use Unicode strings. > -- > http://mail.python.org/mailman/listinfo/python-list > -- http://mail.python.org/mailman/listinfo/python-list