Matt Giuca <[EMAIL PROTECTED]> added the comment: Bill, this debate is getting snipy, and going nowhere. We could argue about what is the "pure" and "correct" thing to do, but we have a limited time frame here, so I suggest we just look at the important facts.
1. There is an overwhelming consensus (including from me) that a str->bytes version is acceptable to have in the library (whether or not it's the "correct solution"). 2. There is an overwhelming consensus (including from you) that a str->str version is acceptable to have in the library (whether or not it's the "correct solution"). 3. By default, the str->str version breaks much less code, so both of us decided to use it by default. To this end, both of our patches: 1. Have a str->bytes version available. 2. Have a str->str version available. 3. Have "quote" and "unquote" functions call the str->str version. So it seems we have agreed on that. Therefore, there should be no more arguing about which is "more right". So all your arguments seem to be essentially saying "the str->bytes methods work perfectly; I don't care about if the str->str methods are correct or not". The fact that your string versions quote UTF-8 and unquote Latin-1 shows just how un-seriously you take the str->str methods. Well the fact is that a) a great many users do NOT SHARE your ideals and will default to using "quote" and "unquote" rather than the bytes functions, and b) all of the rest of the library uses "quote" and "unquote". So from a practical sense, how these methods behave is of the utmost importance - they are more important than any new functions we introduce at this point. For example, the cgi.FieldStorage and the http.server modules will implicitly call unquote and quote. That means whether you, or I, or Guido, or The King Of The Internet likes it or not, we have to have a "most reasonable" solution to the problem of quoting and unquoting strings. > Good thing we don't need to [handle unescaped non-ASCII characters in > unquote]; URIs consist of ASCII characters. Once again, practicality beats purity. I'd argue that it's a *good* (not strictly required) idea to not mangle input unless we have to. > > * Question: How does unquote_bytes deal with unescaped characters? > Not sure I understand this question... I meant unescaped non-ASCII characters, as discussed above (eg. unquote_bytes('\u0123')). > Your test cases probably aren't testing things I feel it's necessary > to test. I'm interested in having the old test cases for urllib > pass, as well as providing the ability to unquote_to_bytes(). I'm sorry, but you're missing the point of test-driven development. If you think there is a bug, you don't just fix it and say "look, the old test cases still pass!" You write new FAILING test cases to demonstrate the bug. Then you change the code to make the test cases pass. All your test suite proves is that you're happy with things the way they are. > Matt, your patch is not some God-given thing here. No, I am merely suggesting that it's had a great deal more thought put into it -- not just my thought, but all the other people in the past month who've suggested different approaches and brought up discussion points. Including yourself -- it was your suggestion in the first place to have the str->bytes functions, which I agree are important. > > <snip> - Quote uses cache > I see no real advantage there, except that it has a built-in > memory leak. Just use a function. Good point. Well the merits of using a cache are completely independent from the behavioural aspects. I simply changed the existing code as little as possible. Hence this patch will have the same performance strengths/weaknesses as all previous versions, and the performance can be tuned after 3.0 if necessary. (Not urgent). On statistics about UTF-8 versus other encodings. Yes, I agree, there are lots of URIs floating around out there, in many different encodings. Unfortunately, we can't implicitly handle them all (and I'm talking once more explicitly about the str->str transform here). We need to pick one as the default. Whether Latin-1 is more popular than UTF-8 *for the time being* is no good reason to pick Latin-1. It is called a "legacy encoding" for a reason. It is being phased out and should NOT be supported from here on in as the default encoding in a major web programming language. (Also there is no point in claiming to be "Unicode compliant" then turning around and supporting a charset with 256 symbols by default). Because Python's urllib will mostly be used in the context of building web apps, it is up to the programmer to decide what encoding to use for h(is|er) web app. For future apps, this should almost certainly be UTF-8 (if it isn't, the website won't be able to accept form input across all characters, so isn't Unicode compliant anyway). The problem you mention of browsers submitting URIs encoded based on the charset is simply something we have to live with. A server will never be able to deal with that unless the URIs are coming from pages which *it served*. As this is very often the case, then as I said above, the app should serve UTF-8 and there'll be no problems. Also note that ALL the browsers I tested (FF/Saf/IE) use UTF-8 no matter what, if you directly type Unicode characters into the address bar. _______________________________________ Python tracker <[EMAIL PROTECTED]> <http://bugs.python.org/issue3300> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com