[issue3300] urllib.quote and unquote - Unicode issues

Matt Giuca Tue, 12 Aug 2008 08:22:10 -0700

Matt Giuca <[EMAIL PROTECTED]> added the comment:

Bill, this debate is getting snipy, and going nowhere. We could argue
about what is the "pure" and "correct" thing to do, but we have a
limited time frame here, so I suggest we just look at the important facts.


1. There is an overwhelming consensus (including from me) that a
str->bytes version is acceptable to have in the library (whether or not
it's the "correct solution").
2. There is an overwhelming consensus (including from you) that a
str->str version is acceptable to have in the library (whether or not
it's the "correct solution").
3. By default, the str->str version breaks much less code, so both of us
decided to use it by default.

To this end, both of our patches:

1. Have a str->bytes version available.
2. Have a str->str version available.
3. Have "quote" and "unquote" functions call the str->str version.

So it seems we have agreed on that. Therefore, there should be no more
arguing about which is "more right".

So all your arguments seem to be essentially saying "the str->bytes
methods work perfectly; I don't care about if the str->str methods are
correct or not". The fact that your string versions quote UTF-8 and
unquote Latin-1 shows just how un-seriously you take the str->str methods.

Well the fact is that a) a great many users do NOT SHARE your ideals and
will default to using "quote" and "unquote" rather than the bytes
functions, and b) all of the rest of the library uses "quote" and
"unquote". So from a practical sense, how these methods behave is of the
utmost importance - they are more important than any new functions we
introduce at this point.

For example, the cgi.FieldStorage and the http.server modules will
implicitly call unquote and quote.

That means whether you, or I, or Guido, or The King Of The Internet
likes it or not, we have to have a "most reasonable" solution to the
problem of quoting and unquoting strings.

> Good thing we don't need to [handle unescaped non-ASCII characters in
> unquote]; URIs consist of ASCII characters.

Once again, practicality beats purity. I'd argue that it's a *good* (not
strictly required) idea to not mangle input unless we have to.

> > * Question: How does unquote_bytes deal with unescaped characters?

> Not sure I understand this question...

I meant unescaped non-ASCII characters, as discussed above (eg.
unquote_bytes('\u0123')).

> Your test cases probably aren't testing things I feel it's necessary
> to test. I'm interested in having the old test cases for urllib
> pass, as well as providing the ability to unquote_to_bytes().

I'm sorry, but you're missing the point of test-driven development. If
you think there is a bug, you don't just fix it and say "look, the old
test cases still pass!" You write new FAILING test cases to demonstrate
the bug. Then you change the code to make the test cases pass. All your
test suite proves is that you're happy with things the way they are.

> Matt, your patch is not some God-given thing here.

No, I am merely suggesting that it's had a great deal more thought put
into it -- not just my thought, but all the other people in the past
month who've suggested different approaches and brought up discussion
points. Including yourself -- it was your suggestion in the first place
to have the str->bytes functions, which I agree are important.

> > <snip> - Quote uses cache

> I see no real advantage there, except that it has a built-in
> memory leak. Just use a function.

Good point. Well the merits of using a cache are completely independent
from the behavioural aspects. I simply changed the existing code as
little as possible. Hence this patch will have the same performance
strengths/weaknesses as all previous versions, and the performance can
be tuned after 3.0 if necessary. (Not urgent).

On statistics about UTF-8 versus other encodings. Yes, I agree, there
are lots of URIs floating around out there, in many different encodings.
Unfortunately, we can't implicitly handle them all (and I'm talking once
more explicitly about the str->str transform here). We need to pick one
as the default. Whether Latin-1 is more popular than UTF-8 *for the time
being* is no good reason to pick Latin-1. It is called a "legacy
encoding" for a reason. It is being phased out and should NOT be
supported from here on in as the default encoding in a major web
programming language.

(Also there is no point in claiming to be "Unicode compliant" then
turning around and supporting a charset with 256 symbols by default).

Because Python's urllib will mostly be used in the context of building
web apps, it is up to the programmer to decide what encoding to use for
h(is|er) web app. For future apps, this should almost certainly be UTF-8
(if it isn't, the website won't be able to accept form input across all
characters, so isn't Unicode compliant anyway).

The problem you mention of browsers submitting URIs encoded based on the
charset is simply something we have to live with. A server will never be
able to deal with that unless the URIs are coming from pages which *it
served*. As this is very often the case, then as I said above, the app
should serve UTF-8 and there'll be no problems. Also note that ALL the
browsers I tested (FF/Saf/IE) use UTF-8 no matter what, if you directly
type Unicode characters into the address bar.

_______________________________________
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue3300>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue3300] urllib.quote and unquote - Unicode issues

Reply via email to