Matt Giuca <[EMAIL PROTECTED]> added the comment: So today I grepped for "urllib" in the entire library in an effort to track down every dependency on quote and unquote to see exactly how my patch breaks other code. I've now investigated every module in the library which uses quote, unquote or urlencode, and my findings are documented below in detail.
So far I have found no code "breakage" except for the original email.util issue I fixed in patch 2. Of course that doesn't mean the behaviour hasn't changed. Nearly all modules in the report below have changed their behaviour so they used to deal with Latin-1-encoded URLs and now deal with UTF-8-encoded URLs. As discussed at length above, I see this as a positive change, since nearly everybody encodes URLs in UTF-8, and of course it allows for all characters. I also point out that the http.server module (unpatched) is internally broken when dealing with filenames with characters outside range(0,256); my patch fixes it. I'm attaching patch 5, which adds a bunch of new test cases to various modules which demonstrate those modules correctly handling UTF-8-encoded URLs. It also fixes a bug in email.utils which I introduced in patch 2. Note that I haven't yet fully investigated urllib.request. Aside from that, the only remaining matter is whether or not it's better to encode URLs as UTF-8 or Latin-1 by default, and I'm pretty sure that question doesn't need debate. So basically I think if there's support for it, this patch is just about ready to be accepted. I'm hoping it can be included in the 3.0b2 release next week. I'd be glad to hear any feedback about this proposal. Not Yet Investigated -------------------- ./urllib/request.py By far the biggest user of quote and unquote. username, password, hostname and paths are now all converted to/from UTF-8 percent-encodings. Other concerns are: * Data in the form application/x-www-form-urlencoded * FTP access I think this needs to be tested further. Looks fine, not tested ---------------------- ./xmlrpc/client.py Just used to decode URI auth string (user:pass). This will change to UTF-8, but is probably OK. ./logging/handlers.py Just uses it in the HTTP handler to encode a dictionary. Probably preferable to use UTF-8 to encode an arbitrary string. ./macurl2path.py Calls to urllib look broken. Not tested. Tested manually, fine --------------------- ./wsgiref/simple_server.py Just used to set PATH_INFO, fine if URLs are UTF-8 encoded. ./http/server.py All uses are for translating between actual file-system paths to URLs. This works fine for UTF-8 URLs. Note that since it uses quote to create URLs in a dir listing, and unquote to handle them, it breaks when unquote is not the inverse of quote. Consider the following simple script: import http.server s = http.server.HTTPServer(('',8000), http.server.SimpleHTTPRequestHandler) s.serve_forever() This will "kind of" work in the unpatched version, using Latin-1 URLs, but filenames with characters above 256 will break (give a 404 error). The patch fixes this. ./urllib/robotparser.py No test cases. Manually tested, URLs properly match when percent-encoded in UTF-8. ./nturl2path.py No test cases available. Manually tested, fine if URLs are UTF-8 encoded. Test cases either exist or added, fine -------------------------------------- ./test/test_urllib.py I wrote a large wad of test cases for all the new functionality. ./wsgiref/util.py Added test cases expecting UTF-8. ./http/cookiejar.py I changed a test case to expect UTF-8. ./email/utils.py I changed this file to behave as it used to, to satisfy its existing test cases. ./cgi.py Added test cases for UTF-8-encoded query strings. Commit log: urllib.parse.unquote: Added "encoding" and "errors" optional arguments, allowing the caller to determine the decoding of percent-encoded octets. As per RFC 3986, default is "utf-8" (previously implicitly decoded as ISO-8859-1). urllib.parse.quote: Added "encoding" and "errors" optional arguments, allowing the caller to determine the encoding of non-ASCII characters before being percent-encoded. Default is "utf-8" (previously characters in range(128, 256) were encoded as ISO-8859-1, and characters above that as UTF-8). Also characters above 128 are no longer allowed to be "safe". Doc/library/urllib.parse.rst: Updated docs on quote and unquote to reflect new interface. Lib/test/test_urllib.py: Added several new test cases testing encoding and decoding Unicode strings with various encodings. This includes updating one test case to now expect UTF-8 by default. Lib/test/test_http_cookiejar.py, Lib/test/test_cgi.py, Lib/test/test_wsgiref.py: Updated and added test cases to deal with UTF-8-encoded URIs. Lib/email/utils.py: Calls urllib.parse.quote and urllib.parse.unquote with encoding="latin-1", to preserve existing behaviour (which the whole email module is dependent upon). Added file: http://bugs.python.org/file10888/parse.py.patch5 _______________________________________ Python tracker <[EMAIL PROTECTED]> <http://bugs.python.org/issue3300> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com