[issue3300] urllib.quote and unquote - Unicode issues

Bill Janssen Thu, 07 Aug 2008 15:58:42 -0700

Bill Janssen <[EMAIL PROTECTED]> added the comment:

> Your original proposal was to make unquote() behave like
> unquote_to_bytes(), which would require changes to virtually every app
> using unqote(), since almost all apps assume the result is a (text)
> string.


Actually, careful apps realize that the result of "unquote" in Python 2 is a
sequence of bytes, and do something careful with that.  So only careless
apps would break, and they'd break in such a way that their maintainers
would have to look at the situation again, and think about it.  Seems like a
'good thing', to me.  And since this is Python 3, fully allowed.  I really
don't understand your position here, I'm afraid.

Setting the default encoding to Latin-1 would prevent these errors,
> but would commit the sin of mojibake (the Japanese word for Perl code
> :-). I don't like that much either.

No, that would be wrong.  Returning a string just for the sake of returning
a string.  Remember, the data percent-encoded is not necessarily a string,
and not necessarily in any known encoding.

>
> A middle ground might be to set the default encoding to ASCII --
> that's closer to Martin's claim that URLs are supposed to be ASCII
> only.

URLs *are* supposed to be ASCII only -- but the percent-encoded byte
sequences in various parts of the path aren't.

This will require many apps to be changed, but at least it forces the
> developers to think about which encoding to assume (perhaps there's
> one handy in the request headers if it's a web app) or about error
> handling or perhaps using unquote_to_bytes().

Yes, this is closer to my line of reasoning.

> However I fear that this middle ground will in practice cause:
>
> (a) more in-the-field failures, since devs are notorious for testing
> with ASCII only; and

Returning bytes deals with this problem.

> (b) the creation of a recipe for "fixing" unquote() calls that fail by
> setting the encoding to UTF-8 without thinking about the alternatives,
> thereby effectively recreating the UTF-8 default with much more pain.

Could be, but at least they will have had to think about.  There's lots of
bad code out there, and maybe by making them think about it, some of it will
improve.

> A secondary concern is that it
> > will invisibly produce invalid data, because it decodes some
> > non-UTF-8-encoded string that happens to only use UTF-8-valid sequences
> > as the wrong string value.
>
> In my experience this is very unlikely. UTF-8 looks like total junk in
> Latin-1, so it's unlikely to occur naturally. If you see something
> that matches a UTF-8 sequence in Latin-1 text, it's most likely that
> in fact it was incorrectly decoded earlier...
>

Latin-1 isn't the only alternate encoding in the world, and not all
percent-encoded byte sequences in URLs are encoded strings.  I'd feel better
if we were being guided by more than your just experience (vast though it
may rightly be said to be!).  Say, by looking at all the URLs that Google
knows about :-).  I'd particularly feel better if some expert in Asian use
of the Web spoke up here...

Added file: http://bugs.python.org/file11076/unnamed

_______________________________________
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue3300>
_______________________________________

<div dir="ltr">&gt; Your original proposal was to make unquote() behave 
like<br>&gt; unquote_to_bytes(), which would require changes to virtually every 
app<br>&gt; using unqote(), since almost all apps assume the result is a 
(text)<br>
&gt; string.<br><br>Actually, careful apps realize that the result of 
&quot;unquote&quot; in Python 2 is a sequence of bytes, and do something 
careful with that.&nbsp; So only careless apps would break, and they&#39;d 
break in such a way that their maintainers would have to look at the situation 
again, and think about it.&nbsp; Seems like a &#39;good thing&#39;, to 
me.&nbsp; And since this is Python 3, fully allowed.&nbsp; I really don&#39;t 
understand your position here, I&#39;m afraid.<br>
<br><div class="gmail_quote"><br><blockquote class="gmail_quote" 
style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; 
padding-left: 1ex;">Setting the default encoding to Latin-1 would prevent these 
errors,<br>

but would commit the sin of mojibake (the Japanese word for Perl code<br>
:-). I don&#39;t like that much either.</blockquote><div><br>No, that would be 
wrong.&nbsp; Returning a string just for the sake of returning a string.&nbsp; 
Remember, the data percent-encoded is not necessarily a string, and not 
necessarily in any known encoding.<br>
</div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 
204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><br>
A middle ground might be to set the default encoding to ASCII --<br>
that&#39;s closer to Martin&#39;s claim that URLs are supposed to be ASCII<br>
only.</blockquote><div><br>URLs *are* supposed to be ASCII only -- but the 
percent-encoded byte sequences in various parts of the path 
aren&#39;t.<br><br></div><blockquote class="gmail_quote" style="border-left: 
1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
This will require many apps to be changed, but at least it forces the<br>
developers to think about which encoding to assume (perhaps there&#39;s<br>
one handy in the request headers if it&#39;s a web app) or about error<br>
handling or perhaps using unquote_to_bytes().</blockquote><div><br>Yes, this is 
closer to my line of reasoning.<br>&nbsp;<br></div><blockquote 
class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 
0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
However I fear that this middle ground will in practice cause:<br>
<br>
(a) more in-the-field failures, since devs are notorious for testing<br>
with ASCII only; and</blockquote><div><br>Returning bytes deals with this 
problem. <br></div><div>&nbsp;</div><blockquote class="gmail_quote" 
style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; 
padding-left: 1ex;">

(b) the creation of a recipe for &quot;fixing&quot; unquote() calls that fail 
by<br>
setting the encoding to UTF-8 without thinking about the alternatives,<br>
thereby effectively recreating the UTF-8 default with much more 
pain.</blockquote><div><br>Could be, but at least they will have had to think 
about.&nbsp; There&#39;s lots of bad code out there, and maybe by making them 
think about it, some of it will improve.<br>
<br></div><blockquote class="gmail_quote" style="border-left: 1px solid 
rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">&gt; A 
secondary concern is that it<br><div class="Ih2E3d">
&gt; will invisibly produce invalid data, because it decodes some<br>
&gt; non-UTF-8-encoded string that happens to only use UTF-8-valid sequences<br>
&gt; as the wrong string value.<br>
<br>
</div>In my experience this is very unlikely. UTF-8 looks like total junk in<br>
Latin-1, so it&#39;s unlikely to occur naturally. If you see something<br>
that matches a UTF-8 sequence in Latin-1 text, it&#39;s most likely that<br>
in fact it was incorrectly decoded earlier...<br>
<div class="Ih2E3d"></div></blockquote><div><br>Latin-1 isn&#39;t the only 
alternate encoding in the world, and not all percent-encoded byte sequences in 
URLs are encoded strings.&nbsp; I&#39;d feel better if we were being guided by 
more than your just experience (vast though it may rightly be said to 
be!).&nbsp; Say, by looking at all the URLs that Google knows about :-).&nbsp; 
I&#39;d particularly feel better if some expert in Asian use of the Web spoke 
up here...<br>
</div><br></div><br></div>

_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue3300] urllib.quote and unquote - Unicode issues

Reply via email to