[issue1712522] urllib.quote throws exception on Unicode URL

Matt Giuca Mon, 19 Jul 2010 06:00:51 -0700

Matt Giuca <matt.gi...@gmail.com> added the comment:

> I think everyone assumed that the parameter should be a "str" object
> and nothing else. Apparently some people used it accidentally with
> some unicode strings and it "worked" until these strings contained
> non-ASCII characters.


I don't consider use of Unicode strings in Python 2.7 to be "accidental". In my 
experience with Python 2, pretty much everything already works with Unicode 
strings, and it's best practice to use them.

Now one of the major goals of Python 2.6/2.7 is to allow the writing of code 
which ports smoothly to Python 3. Unicode support is a major issue here. To 
quote "What's new in Python 3" (http://docs.python.org/py3k/whatsnew/3.0.html):
"To be prepared in Python 2.x, start using unicode for all unencoded text, and 
str for binary or encoded data only. Then the 2to3  tool will do most of the 
work for you."
Having functions in Python 2.7 which don't accept Unicode (or worse, raise 
random exceptions) runs against best practices for moving to Python 3.

> If we were following you, we would add "encoding" and "errors" arguments
> to any str-accepting 2.x function, so that it can also accept unicode
> strings. That's certainly not a reasonable solution.

No, that's certainly not necessary. You don't need an "encoding" or "errors" 
argument on any given function in order to support unicode. In fact, most code 
written to work with strings naturally works with Unicode because unicode 
strings support the same basic operations.

The need for an "encoding" and "errors", and in fact the need to deal with 
string encoding at all with urllib.quote is due to the special nature of URLs. 
If URLs had a syntax like %uXXXX then there would be no need for encoding 
Unicode strings (as in UTF-8) at all. However, because the RFC specifies that 
Unicode strings are to be encoded into a byte sequence *using an unspecified 
encoding*, it is therefore necessary, for this specific function, to ask the 
programmer which encoding to use.

Thus I assure you, this is not just one random function I have picked to add 
these arguments to. This is the only one (that I know of) that requires them to 
support Unicode.

> The original issue is against robotparser, and clearly states a bug
> (robotparser doesn't work in some cases).

I don't know why this keeps coming back to robotparser. The original bug was 
not against robotparser; it is called "quote throws exception on Unicode URL" 
and that is the bug. Robotparser was just one demonstrative piece of code which 
failed because of it.

Having said that, I don't expect to continue this argument. If you (the Python 
developers) decide that it's too late to accept this, then I won't object to 
reverting it.

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue1712522>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue1712522] urllib.quote throws exception on Unicode URL

Reply via email to