Matt Giuca <[EMAIL PROTECTED]> added the comment: > Bill's main concern is with a policy decision; I doubt he would > object to using your code once that is resolved.
But his patch does the same basic operations as mine, just implemented differently and with the heap of issues I outlined above. So it doesn't have anything to do with the policy decision. > The purpose of the quoting functions is to turn a string > (representing the human-readable version) into bytes (that go > over the wire). Ah hang on, that's a misunderstanding. There is a two-step process involved. Step 1. Translate <character/byte> string into an ASCII character string by percent-encoding the <characters/bytes>. (If percent-encoding characters, use an unspecified encoding). Step 2. Serialize the ASCII character string into an octet sequence to send it over the wire, using some unspecified encoding. Step 1 is explained in detail throughout the RFC, particularly in Section 1.2.1 Transcription ("Percent-encoded octets may be used within a URI to represent characters outside the range of the US-ASCII coded character set") and 2.1 Percent Encoding. Step 2 is not actually part of the spec (because the spec outlines URIs as character sequences, not how to send them over a network). It is briefly described in Section 2 ("This specification does not mandate any particular character encoding for mapping between URI characters and the octets used to store or transmit those characters. When a URI appears in a protocol element, the character encoding is defined by that protocol"). Section 1.2.1: > A URI may be represented in a variety of ways; e.g., ink on > paper, pixels on a screen, or a sequence of character > encoding octets. The interpretation of a URI depends only on > the characters used and not on how those characters are > represented in a network protocol. The RFC then goes on to describe a scenario of writing a URI down on a napkin, before stating: > A URI is a sequence of characters that is not always represented > as a sequence of octets. Right, so there is no debate that a URI (after percent-encoding) is a character string, not a byte string. The debate is only whether it's a character or byte string before percent-encoding. Therefore, the concept of "quote_as_bytes" is flawed. > You feel wire-protocol bytes should be treated as > strings, if only as bytestrings, because the libraries use them > that way. No I do not. URIs post-encoding are character strings, in the Unicode sense of the term "character". This entire topic has nothing to do with the wire. Note that the "charset" or "encoding" parameter in Bill/My patch respectively isn't the mapping from URI strings to octets (that's trivially ASCII). It's the charset used to encode character information into octets which then get percent-encoded. > The old code (and test cases) assumed Latin-1. No, the old code and test cases were written for Python 2.x. They assumed a byte string was being emitted (back when a byte string was a string, so that was an acceptable output type). So they weren't assuming an encoding. In fact the *ONLY* test case for Unicode in test_urllib used a UTF-8-encoded string. > r = urllib.parse.unquote('br%C3%BCckner_sapporo_20050930.doc') > self.assertEqual(r, 'br\xc3\xbcckner_sapporo_20050930.doc') In Python 2.x, this test case says "unquote('%C3%BC') should give me the byte sequence '\xc3\xbc'", which is a valid case. In Python 3.0, the code didn't change but the meaning subtly did. Now it says "unquote('%C3%BC') should give the string 'ü'". The name is clearly supposed to be "brückner", not "brückner", which means in Python 3.0 we should EITHER be expecting the BYTE string b'\xc3\xbc' or the character string 'ü'. So the old code and test cases didn't assume any encoding, then they were accidentally made to assume Latin-1 by the fact that the language changed underneath them. _______________________________________ Python tracker <[EMAIL PROTECTED]> <http://bugs.python.org/issue3300> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com