[issue3300] urllib.quote and unquote - Unicode issues

Matt Giuca Thu, 07 Aug 2008 08:00:13 -0700

Matt Giuca <[EMAIL PROTECTED]> added the comment:

Following Guido and Antoine's reviews, I've written a new patch which
fixes *most* of the issues raised. The ones I didn't fix I have noted
below, and commented on the review site
(http://codereview.appspot.com/2827/). Note: I intend to address all of
these issues after some discussion.


Outstanding issues raised by the reviews:

Doc/library/urllib.parse.rst:
Should unquote accept a bytes/bytearray as well as a str?

Lib/email/utils.py:
Should encode_rfc2231 with charset=None accept strings with non-ASCII
characters, and just encode them to UTF-8?

Lib/test/test_http_cookiejar.py:
Does RFC 2965 let me get away with changing the test case to expect
UTF-8? (I'm pretty sure it doesn't care what encoding is used).

Lib/test/test_urllib.py:
Should quote raise a TypeError if given a bytes with encoding/errors
arguments? (Motivation: TypeError is what you usually raise if you
supply too many args to a function).

Lib/urllib/parse.py:
(As discussed above) Should quote accept safe characters outside the
ASCII range (thereby potentially producing invalid URIs)?

------

Commit log for patch8:

Fix for issue 3300.

urllib.parse.unquote: Added "encoding" and "errors" optional arguments,
allowing the caller to determine the decoding of percent-encoded octets.
As per RFC 3986, default is "utf-8" (previously implicitly decoded as
ISO-8859-1). Also fixed a bug in which mixed-case hex digits (such as
"%aF") weren't being decoded at all.

urllib.parse.quote: Added "encoding" and "errors" optional arguments,
allowing the caller to determine the encoding of non-ASCII characters
before being percent-encoded. Default is "utf-8" (previously characters
in range(128, 256) were encoded as ISO-8859-1, and characters above that
as UTF-8). Also characters/bytes above 128 are no longer allowed to be
"safe". Also now allows either bytes or strings.

Added functions urllib.parse.quote_from_bytes,
urllib.parse.unquote_to_bytes. All quote/unquote functions now exported
from the module.

Doc/library/urllib.parse.rst: Updated docs on quote and unquote to
reflect new interface, added quote_from_bytes and unquote_to_bytes.

Lib/test/test_urllib.py: Added many new test cases testing encoding
and decoding Unicode strings with various encodings, as well as testing
the new functions.

Lib/test/test_http_cookiejar.py, Lib/test/test_cgi.py,
Lib/test/test_wsgiref.py: Updated and added test cases to deal with
UTF-8-encoded URIs.

Lib/email/utils.py: Calls urllib.parse.quote and urllib.parse.unquote
with encoding="latin-1", to preserve existing behaviour (which the whole
email module is dependent upon).

Added file: http://bugs.python.org/file11069/parse.py.patch8

_______________________________________
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue3300>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue3300] urllib.quote and unquote - Unicode issues

Reply via email to