[issue3300] urllib.quote and unquote - Unicode issues

Guido van Rossum Tue, 12 Aug 2008 16:44:04 -0700

Guido van Rossum <[EMAIL PROTECTED]> added the comment:

> Matt Giuca <[EMAIL PROTECTED]> added the comment:
> By the way, what is the current status of this bug? Is anybody waiting
> on me to do anything? (Re: Patch 9)


I'll be reviewing it today or tomorrow. From looking at it briefly I
worry that the implementation is pretty slow -- a method call for each
character and a map() call sounds pretty bad.

> To recap my previous list of outstanding issues raised by the review:
>
>> Should unquote accept a bytes/bytearray as well as a str?
> Currently, does not. I think it's meaningless to do so (and how to
> handle >127 bytes, if so?)

The bytes > 127 would be translated as themselves; this follows
logically from how stuff is parsed -- %% and %FF are translated,
everything else is not. But I don't really care, I doubt there's a
need.

>> Lib/email/utils.py:
>> Should encode_rfc2231 with charset=None accept strings with non-ASCII
>> characters, and just encode them to UTF-8?
> Currently does. Suggestion to restrict to ASCII on the review tracker;
> simple fix.

I think I agree with that comment; it seems wrong to return UTF8
without setting that in the header. The alternative would be to
default charset to utf8 if there are any non-ASCII chars in the input.
I'd be okay with that too.

>> Should quote raise a TypeError if given a bytes with encoding/errors
>> arguments? (Motivation: TypeError is what you usually raise if you
>> supply too many args to a function).
> Resolved. Raises TypeError.
>
>> Lib/urllib/parse.py:
>> (As discussed above) Should quote accept safe characters outside the
>> ASCII range (thereby potentially producing invalid URIs)?
> Resolved? Implemented, but too messy and not worth it just to produce
> invalid URIs, so NOT in patch.

Agreed, safe should be ASCII chars only.

> That's only two very minor yes/no issues remaining. Please comment.

I believe patch 9 still has errors defaulting to strict for quote().
Weren't you going to change that?

Regarding using UTF-8 as the default encoding, I still think this the
right thing to do -- while the tables shown by Bill indicate that
there's still a lot of Latin-1 out there, UTF-8 is definitely gaining
on it, and I expect that Python apps, especially Py3k apps, are much
more likely to follow (and hopefully reinforce! :-) this trend than to
lag behind.

_______________________________________
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue3300>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue3300] urllib.quote and unquote - Unicode issues

Reply via email to