Martin v. Löwis <[EMAIL PROTECTED]> added the comment: > As for treating Latin-1 as a raw encoding, how can that be theoretically > okay if the parser assumes UTF-8 and Latin-1 is not a superset of Latin-1?
The parser doesn't assume UTF-8, but "ascii+", i.e. it passes all non-ASCII bytes on to the AST, which then needs to deal with them; it then could (but apparently doesn't) take into account whether the internal representation was UTF-8 or Latin-1: see ast.c:decode_unicode for some remains of that. The other case (besides string literals) where bytes > 127 matter is tokenizer.c:verify_identifier; this indeed assumes UTF-8 only (but could be easily extended to support Latin-1 as well). The third case where non-ASCII bytes are allowed is comments; there they are entirely ignored (i.e. it is not even verified that the comment is well-formed UTF-8). Removal of the special case should simplify the code; I would agree that any speedup gained by not going through a codec is irrelevant. I'm still puzzled why test_imp if the special case is removed. _______________________________________ Python tracker <[EMAIL PROTECTED]> <http://bugs.python.org/issue3574> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com