Matt Giuca <[EMAIL PROTECTED]> added the comment: OK I've gone back over the patch and decided to add the "encoding" and "errors" arguments from the str.encode/decode methods as optional arguments to quote and unquote. This is a much bigger change than I originally intended, but I think it makes things much better because we'll get UTF-8 by default (which as far as I can tell is by far the most common encoding).
(Tom Pinckney just made the same suggestion right as I'm typing this up!) So my new patch is a bit more extensive, and changes the interface (in a backwards-compatible way). Both quote and unquote now support "encoding" and "errors" arguments, defaulting to "utf-8" and "replace", respectively. Implementation detail: This changes the Quoter class a lot; it now hashes four fields to ensure it doesn't use the wrong cache. Also fixed an issue with the previous patch where non-ASCII-compatible encodings broke for code points < 128. I then ran the full test suite and discovered two other modules test cases broke. I've fixed them so the full suite passes, but I'm suspicious there may be more issues (see below). * Lib/test/test_http_cookiejar.py: A test case was written explicitly expecting Latin-1 encoding. I've changed this test case to expect UTF-8. * Lib/email/utils.py: I extensively analysed this code and discovered that it kind of "cheats" - it uses the Latin-1 encoding and treats it as octets, then applies its own encoding scheme. So to fix this, I changed the email module to call quote and unquote with encoding="latin-1". Hence it has the same behaviour as before. Some potential issues: * I have not updated the documentation yet. If this idea is to go ahead, the docs will need to show these new optional arguments. (I'll do that myself but haven't yet). * While the full test suite passes, I'm sure there will be many more issues since I've changed the interface. Therefore I don't recommend this patch is accepted just yet. I plan to do an investigation into all uses (within the standard lib) of quote and unquote to see if there are any other compatibility issues, particularly within urllib. Hence I'll respond to this again in a few days. * The new patch to "safe" argument of quote allows non-ASCII characters to be made safe. This correspondingly allows the construction of URIs with non-ASCII characters. Is it better to allow users to do this if they really want, or just mysteriously fail to let those characters through? I would also like to have a separate pair of functions, unquote_raw and quote_raw, which work on bytes objects instead of strings. (unquote_raw would take a str and produce a bytes, while quote_raw would take a bytes and produce a str). As URI encoding is fundamentally an octet encoding, not a character encoding, this is the only way to do URI encoding without choosing a Unicode character encoding. (I see some modules such as "email" treating the implicit Latin-1 encoding as byte encoding, which is a bit dodgy - they could benefit from raw functions). But as that requires further changes to the interface, I'll save it for another day. Patch (parse.py.patch2) is for branch /branches/py3k, revision 64820. Commit log: urllib.parse.unquote: Added "encoding" and "errors" optional arguments, allowing the caller to determine the decoding of percent-encoded octets (previously implicitly decoded as ISO-8859-1). As per RFC 3986, default is "utf-8". urllib.parse.quote: Added "encoding" and "errors" optional arguments, allowing the caller to determine the encoding of non-ASCII characters before being percent-encoded (previously characters in range(128, 256) were encoded as ISO-8859-1, and characters above that as UTF-8). Also fixed characters greater than 256 not responding to "safe", and also not being cached. Lib/test/test_urllib.py, Lib/test/test_http_cookiejar.py: Updated test cases which expected output in ISO-8859-1, now expects UTF-8. Lib/email/utils.py: Calls urllib.parse.quote and urllib.parse.unquote with encoding="latin-1", to preserve existing behaviour (which the whole email module is dependent upon). Added file: http://bugs.python.org/file10870/parse.py.patch2 _______________________________________ Python tracker <[EMAIL PROTECTED]> <http://bugs.python.org/issue3300> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com