Matt Giuca <[EMAIL PROTECTED]> added the comment: > 3.0b1 has been released, so no new features can be added to 3.0.
While my proposal is no doubt going to cause a lot of code breakage, I hardly consider it a "new feature". This is very definitely a bug. As I understand it, the point of a code freeze is to stop the addition of features which could be added to a later version. Realistically, there is no way this issue can be fixed after 3.0 is released, as it necessarily involves changing the behaviour of this function. Perhaps I should explain further why this is a regression from Python 2.x and not a feature request. In Python 2.x, with byte strings, the encoding is not an issue. quote and unquote simply encode bytes, and if you want to use Unicode you have complete control. In Python 3.0, with Unicode strings, if functions manipulate string objects, you don't have control over the encoding unless the functions give you explicit control. So Python 3.0's native Unicode strings have broken the library. I give two examples. Firstly, I believe that unquote(quote(x)) should always be true for all strings x. In Python 2.x, this is always trivially true (for non-Unicode strings), because they simply encode and decode the octets. In Python 3.0, the two functions are inconsistent, and break out of the range(0, 256). >>> urllib.parse.unquote(urllib.parse.quote('ÿ')) # '\u00ff' 'ÿ' # Works, because both functions work with ISO-8859-1 in this range. >>> urllib.parse.unquote(urllib.parse.quote('Ā')) # '\u0100' 'Ä\x80' # Fails, because quote uses UTF-8 and unquote uses ISO-8859-1. My patch succeeds for all characters. >>> urllib.parse.unquote(urllib.parse.quote('Ā')) # '\u0100' 'Ā' Secondly, a bigger example, but I want to demonstrate how this bug affects web applications, even very simple ones. Consider this simple (beginnings of a) wiki system in Python 2.5, as a CGI app: #--- import cgi fields = cgi.FieldStorage() title = fields.getfirst('title') print("Content-Type: text/html; charset=utf-8") print("") print('<p>Debug: %s</p>' % repr(title)) if title is None: print("No article selected") else: print('<p>Information about %s.</p>' % cgi.escape(title)) #--- (Place this in cgi-bin, navigate to it, and add the query string "?title=Page Title"). I'll use the page titled "Mátt" as a test case. If you navigate to "?title=Mátt", it displays the text "Debug: 'M\xc3\xa1tt'. Information about Mátt.". The browser (at least Firefox, Safari and IE I have tested) encodes this as "?title=M%C3%A1tt". So this is trivial, as it's just being unquoted into a raw byte string 'M\xc3\xa1tt', then written out again as a byte string. Now consider that you want to manipulate it as a Unicode string, still in Python 2.5. You could augment the program to decode it as UTF-8 and then re-encode it. (I wrote a simple UTF-8 printing function which takes Unicode strings as input). #--- import sys import cgi def printu8(*args): """Prints to stdout encoding as utf-8, rather than the current terminal encoding. (Not a fully-featured print function).""" sys.stdout.write(' '.join([x.encode('utf-8') for x in args])) sys.stdout.write('\n') fields = cgi.FieldStorage() title = fields.getfirst('title') if title is not None: title = str(title).decode("utf-8", "replace") print("Content-Type: text/html; charset=utf-8") print("") print('<p>Debug: %s.</p>' % repr(title)) if title is None: print("No article selected.") else: printu8('<p>Information about %s.</p>' % cgi.escape(title)) #--- Now given the same input ("?title=Mátt"), it displays "Debug: u'M\xe1tt'. Information about Mátt." Still working fine, and I can manipulate it as Unicode because in Python 2.x I have direct control over encoding/decoding. Now let us upgrade this program to Python 3.0. (Note that I still can't print Unicode characters directly out, because running through Apache the stdout encoding is not UTF-8, so I use my printu8 function). #--- import sys import cgi def printu8(*args): """Prints to stdout encoding as utf-8, rather than the current terminal encoding. (Not a fully-featured print function).""" sys.stdout.buffer.write(b' '.join([x.encode('utf-8') for x in args])) sys.stdout.buffer.write(b'\n') fields = cgi.FieldStorage() title = fields.getfirst('title') # Note: No call to decode. I have no opportunity to specify the encoding since # it comes straight out of FieldStorage as a Unicode string. print("Content-Type: text/html; charset=utf-8") print("") print('<p>Debug: %s.</p>' % ascii(title)) if title is None: print("No article selected.") else: printu8('<p>Information about %s.</p>' % cgi.escape(title)) #--- Now given the same input ("?title=Mátt"), it displays "Debug: 'M\xc3\xa1tt'. Information about Mátt." Once again, it is erroneously (and implicitly) decoded as ISO-8859-1, so I end up with a meaningless Unicode string. The only possible thing I can do about this as a web developer is call title.encode('latin-1').decode('utf-8') - a dreadful hack. With my patch applied, the input ("?title=Mátt") produces the output "Debug: 'M\xe1tt'. Information about Mátt." Basically, this bug is going to affect all web developers as soon as someone types a non-ASCII character. You could argue that supporting UTF-8 by default is no better than supporting Latin-1 by default, but it is. UTF-8 supports encoding of all characters where Latin-1 does not, UTF-8 is the recommended URI encoding by both the URI Syntax RFC[1] and the W3C HTML 4.01 specification[2], and all major browsers use it to encode non-ASCII characters in URIs. My patch may not be the best, or most conservative, solution to this problem. I'm happy to see other proposals. But it's clearly an important bug to fix, if I can't even write the simplest web app I can think of without having to use a kludgey hack to get the string decoded correctly. What is the point of having nice clean Unicode strings in the language if the library spits out the wrong characters and it requires more work to fix them than it used to with byte strings? [1] http://tools.ietf.org/html/rfc3986#section-2.5 [2] http://www.w3.org/TR/REC-html40/appendix/notes.html#h-B.2.1 _______________________________________ Python tracker <[EMAIL PROTECTED]> <http://bugs.python.org/issue3300> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com