[issue3300] urllib.quote and unquote - Unicode issues

Matt Giuca Sun, 10 Aug 2008 00:06:03 -0700

Matt Giuca <[EMAIL PROTECTED]> added the comment:

Guido suggested that quote's "safe" parameter should allow any
character, not just ASCII range. I've implemented this now. It was a lot
messier than I imagined.


The problem is that in my older patches, both 's' and 'safe' are encoded
to bytes right away, and the rest of the process is just octet encoding
(matching each byte against the safe set to see whether or not to quote it).

The new implementation requires that you delay encoding both of these
till the iteration over the string, so you match each *character*
against the safe set, then encode it if it's not in 'safe'. Now the
problem is some encodings/errors produce bytes which are in the safe
range. For instance quote('\u6f22', encoding='latin-1',
errors='xmlcharrefreplace') should give "%26%2328450%3B" (which is
"&#28450;" encoded). To preserve this behaviour, you then have to check
each *byte* of the encoded character against a 'safe bytes' set. I
believe that will slow down the implementation considerably.

In summary, it requires two levels of encoding: first characters, then
bytes. You can see how messy it made my quote implementation - I've
attached the patch (parse.py.patch8+allsafe).

I don't think it's worth the extra code bloat and performance hit just
to implement a feature whose only use is producing invalid URIs (since
URIs are supposed to only have ASCII characters). Does anyone disagree,
and want this feature in?

Added file: http://bugs.python.org/file11092/parse.py.patch8+allsafe

_______________________________________
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue3300>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue3300] urllib.quote and unquote - Unicode issues

Reply via email to