[issue3300] urllib.quote and unquote - Unicode issues

Matt Giuca Sat, 09 Aug 2008 20:10:11 -0700

Matt Giuca <[EMAIL PROTECTED]> added the comment:

> Bill's main concern is with a policy decision; I doubt he would
> object to using your code once that is resolved.


But his patch does the same basic operations as mine, just implemented
differently and with the heap of issues I outlined above. So it doesn't
have anything to do with the policy decision.

> The purpose of the quoting functions is to turn a string
> (representing the human-readable version) into bytes (that go
> over the wire).

Ah hang on, that's a misunderstanding. There is a two-step process involved.

Step 1. Translate <character/byte> string into an ASCII character string
by percent-encoding the <characters/bytes>. (If percent-encoding
characters, use an unspecified encoding).
Step 2. Serialize the ASCII character string into an octet sequence to
send it over the wire, using some unspecified encoding.

Step 1 is explained in detail throughout the RFC, particularly in
Section 1.2.1 Transcription ("Percent-encoded octets may be used within
a URI to represent characters outside the range of the US-ASCII coded
character set") and 2.1 Percent Encoding.

Step 2 is not actually part of the spec (because the spec outlines URIs
as character sequences, not how to send them over a network). It is
briefly described in Section 2 ("This specification does not mandate any
particular character encoding for mapping between URI characters and the
octets used to store or transmit those characters.  When a URI appears
in a protocol element, the character encoding is defined by that protocol").

Section 1.2.1:

> A URI may be represented in a variety of ways; e.g., ink on
> paper, pixels on a screen, or a sequence of character
> encoding octets.  The interpretation of a URI depends only on
> the characters used and not on how those characters are
> represented in a network protocol.

The RFC then goes on to describe a scenario of writing a URI down on a
napkin, before stating:

> A URI is a sequence of characters that is not always represented
> as a sequence of octets.

Right, so there is no debate that a URI (after percent-encoding) is a
character string, not a byte string. The debate is only whether it's a
character or byte string before percent-encoding.

Therefore, the concept of "quote_as_bytes" is flawed.

> You feel wire-protocol bytes should be treated as
> strings, if only as bytestrings, because the libraries use them
> that way.

No I do not. URIs post-encoding are character strings, in the Unicode
sense of the term "character". This entire topic has nothing to do with
the wire.

Note that the "charset" or "encoding" parameter in Bill/My patch
respectively isn't the mapping from URI strings to octets (that's
trivially ASCII). It's the charset used to encode character information
into octets which then get percent-encoded.

> The old code (and test cases) assumed Latin-1.

No, the old code and test cases were written for Python 2.x. They
assumed a byte string was being emitted (back when a byte string was a
string, so that was an acceptable output type). So they weren't assuming
an encoding. In fact the *ONLY* test case for Unicode in test_urllib
used a UTF-8-encoded string.

> r = urllib.parse.unquote('br%C3%BCckner_sapporo_20050930.doc')
> self.assertEqual(r, 'br\xc3\xbcckner_sapporo_20050930.doc')

In Python 2.x, this test case says "unquote('%C3%BC') should give me the
byte sequence '\xc3\xbc'", which is a valid case. In Python 3.0, the
code didn't change but the meaning subtly did. Now it says
"unquote('%C3%BC') should give the string 'Ã¼'". The name is clearly
supposed to be "brückner", not "brÃ¼ckner", which means in Python 3.0 we
should EITHER be expecting the BYTE string b'\xc3\xbc' or the character
string 'ü'.

So the old code and test cases didn't assume any encoding, then they
were accidentally made to assume Latin-1 by the fact that the language
changed underneath them.

_______________________________________
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue3300>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue3300] urllib.quote and unquote - Unicode issues

Reply via email to