On Mar 20, 2:20 pm, Laszlo Nagy <[EMAIL PROTECTED]> wrote: > > >> >>> eval( u'"徹底したコスト削減 ÁÍŰŐÜÖÚÓÉ трирова"' ) == eval( '"徹底し > >> たコスト削減 ÁÍŰŐÜÖÚÓÉ трирова"' ) > >> True > > > When you feed your unicode data into eval(), it doesn't have any > > encoding or decoding work to do. > > Yes, but what about > > eval( 'u' + '"徹底したコスト削減 ÁÍŰŐÜÖÚÓÉ трирова"' ) >
Let's take it apart, bit by bit: 'u' - A byte string with one byte, which is 117 '"徹底したコスト削減 ÁÍŰŐÜÖÚÓÉ трирова"' - A byte string starting with " (34), but then continuing in an unspecified byte sequence. I don't know what encoding your terminal/file/whatnot is written in. Assuming it is in UTF-8 and not UTF-16, then it would be the UTF-8 representation of the unicode code points that follow. Before you are passing it to eval, you are concatenating them. So now you have a byte string that starts with u, then ", then something beyond 128. Now, when you are calling eval, you are passing in that byte string. This byte string, it is important to emphasize, is not text. It is text encoded in some format. Here is what my interpreter does (in a UTF-8 console): >>> u"徹底したコスト削減 ÁÍŰŐÜÖÚÓÉ трирова" u'\u5fb9\u5e95\u3057\u305f\u30b3\u30b9\u30c8\u524a\u6e1b \xc1\xcd \u0170\u0150\xdc\xd6\xda\xd3\xc9 \u0442\u0440\u0438\u0440\u043e \u0432\u0430' The first item in the sequence is \u5fb9 -- a unicode code point. It is NOT a byte. >>> eval( '"徹底したコスト削減 ÁÍŰŐÜÖÚÓÉ трирова"' ) '\xe5\xbe\xb9\xe5\xba\x95\xe3\x81\x97\xe3\x81\x9f \xe3\x82\xb3\xe3\x82\xb9\xe3\x83\x88\xe5\x89\x8a\xe6\xb8\x9b \xc3\x81\xc3\x8d\xc5\xb0\xc5\x90\xc3\x9c\xc3\x96\xc3\x9a \xc3\x93\xc3\x89 \xd1\x82\xd1\x80\xd0\xb8\xd1\x80\xd0\xbe \xd0\xb2\xd0\xb0' The first item in the sequence is \xe5. This IS a byte. This is NOT a unicode point. It doesn't represent anything except what you want it to represent. >>> eval( 'u"徹底したコスト削減 ÁÍŰŐÜÖÚÓÉ трирова"' ) u'\xe5\xbe\xb9\xe5\xba\x95\xe3\x81\x97\xe3\x81\x9f \xe3\x82\xb3\xe3\x82\xb9\xe3\x83\x88\xe5\x89\x8a\xe6\xb8\x9b \xc3\x81\xc3\x8d\xc5\xb0\xc5\x90\xc3\x9c\xc3\x96\xc3\x9a \xc3\x93\xc3\x89 \xd1\x82\xd1\x80\xd0\xb8\xd1\x80\xd0\xbe \xd0\xb2\xd0\xb0' The first item in the sequence is \xe5. This is NOT a byte. This is a unicode point-- LATIN SMALL LETTER A WITH RING ABOVE. >>> eval( u'u"徹底したコスト削減 ÁÍŰŐÜÖÚÓÉ трирова"' ) u'\u5fb9\u5e95\u3057\u305f\u30b3\u30b9\u30c8\u524a\u6e1b \xc1\xcd \u0170\u0150\xdc\xd6\xda\xd3\xc9 \u0442\u0440\u0438\u0440\u043e \u0432\u0430' The first item in the sequence is \u5fb9, which is a unicode point. In the Python program file proper, if you have your encoding setup properly, the expression u"徹底したコスト削減 ÁÍŰŐÜÖÚÓÉ трирова" is a perfectly valid expression. What happens is the Python interpreter reads in that string of bytes between the quotes, interprets them to unicode based on the encoding you already specified, and creates a unicode object to represent that. eval doesn't muck with encodings. I'll try to address your points below in the context of what I just wrote. > The passed expression is not unicode. It is a "normal" string. A > sequence of bytes. Yes. > It will be evaluated by eval, and eval should know > how to decode the byte sequence. You think eval is smarter than it is. > Same way as the interpreter need to > know the encoding of the file when it sees the u"徹底したコスト削減 > ÁÍŰŐÜÖÚÓÉ трирова" byte sequence in a python source file - before > creating the unicode instance, it needs to be decoded (or not, depending > on the encoding of the source). > Precisely. And it is. Before it is passed to eval/exec/whatever. > String passed to eval IS python source, and it SHOULD have an encoding > specified (well, unless it is already a unicode string, in that case > this magic is not needed). > If it had an encoding specified, YOU should have decoded it and passed in the unicode string. > Consider this: > > exec(""" > import codecs > s = u'Ű' > codecs.open("test.txt","w+",encoding="UTF8").write(s) > """) > > Facts: > > - source passed to exec is a normal string, not unicode > - the variable "s", created inside the exec() call will be a unicode > string. However, it may be Û or something else, depending on the > source encoding. E.g. ASCII encoding it is invalid and exec() should > raise a SyntaxError like: > > SyntaxError: Non-ASCII character '\xc5' in file c:\temp\aaa\test.py on > line 1, but no encoding declared; > seehttp://www.python.org/peps/pep-0263.htmlfor details > > Well at least this is what I think. If I'm not right then please explain > why. > If you want to know what happens, you have to try it. Here's what happens (again, in my UTF-8 terminal): >>> exec(""" ... import codecs ... s = u'Ű' ... codecs.open("test.txt","w+",encoding="UTF8").write(s) ... """) >>> s u'\xc5\xb0' >>> print s Ű >>> file('test.txt').read() '\xc3\x85\xc2\xb0' >>> print file('test.txt').read() Ű Note that s is a unicode string with 2 unicode code points. Note that the file has 4 bytes--since it is that 2-code sequence encoded in UTF-8, and both codes are not ASCII. Your problem is, I think, that you think the magic of decoding source code from the byte sequence into unicode happens in exec or eval. It doesn't. It happens in between reading the file and passing the contents of the file to exec or eval. -- http://mail.python.org/mailman/listinfo/python-list