On Mon, Jul 22, 2013 at 05:15:50PM +1000, Nick Coghlan wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> On 07/22/2013 03:25 PM, Toshio Kuratomi wrote:
> 
> > If python3 could just finally fix outputting text with
> > surrogateescaped bytes then it would finally clean up the last
> > portion of this and I would be able to stop pointing out the
> > various ways that python3's unicode handling is just as broken as
> > pyhton2's -- just in different ways. :-)
> 
> Attempting to encode data containing surrogate escapes without setting
> "errors=surrogateescape" is a sign that tainted data has escaped
> somewhere. So it's late notification of an error, but still indicative
> of an error somewhere. We'll never silence it by default.
> 
That's a bit simplified from what python3's direction on this is unless
Victor Stinner's work is only intended to be temporary.

$ export LC_ALL=en_US.utf-8
$ mkdir abc$'\xff'
$ python3.3
>>> import os
>>> se_dirlisting = os.listdir('.')
>>> # surrogateescape in a text string:
>>> repr(se_dirlisting[0])
"'abc\\udcff'"
>>> # This doesn't traceback and it has to encode se_dirlisting when passing
>>> # it out of python:
>>> os.listdir(se_dirlisting[0])
[]
>>> # Works with other modules as well:
>>> import subprocess
>>> subprocess.call(['ls', se_dirlisting[0]])
0

AFAIK, the justification is that the surrogateescape'd strings are both
coming from and going to the OS.  They're crossing outside of the line that
python3 draws around itself and there's an implicit encoding and decoding
there.  This seems fine to me as a strategy.  The problems are just that
there are places where python3 doesn't yet use surrogateescape when crossing
this boundary.  The one I was specifically thinking of when I wrote this was
the print() function:

>>> print(se_dirlisting[0])
Traceback (most recent call last):
  File "<stdin>", line 3, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcff' in position 
3: surrogates not allowed

(When I mentioned this at pycon you brought up:
http://bugs.python.org/issue15216 which looked promising but seems to have
stalled ;-)

> > 
> >> * Inputs tainted with different assumptions? Immediate
> >> ValueError complaining about the taint mismatch
> >> 
> >> String encoding would be updated to trigger a ValueError when
> >> asked to encode tainted strings to an encoding other than the
> >> tainted one.
> >> 
> > 
> > I'm a little leery of these.  The reason is that after using both
> > python2 and the early versions of python3 I became a firm believer
> > that the problem with python2's unicode handling wasn't that it
> > threw exceptions, rather the problem was that the same bit of code
> > was too prone to passing through certain data without error and
> > throwing an error with other data. Programmers who tested their
> > code with only ascii data or only data encoded in their locale's
> > encoding, or only when their locale was a utf-8 encoding were
> > unable to replicate or understand the errors that their user's got
> > when they ran them in the crazy real-world environments that user's
> > inevitably have.  These rules that throw an Exception suffer from
> > the same reliance on the specific data and environment and will
> > lead to similar tracebacks that programmers won't be able to easily
> > replicate.
> 
> You can't get away from this problem: decoding with the wrong encoding
> corrupts your data.
>
Incorrect.  Decoding with the wrong encoding and then attempting to operate
on the decoded data in certain way will corrupt your data.  Operating on it
in certain other ways will not.  surrogateescape was designed because people
realized that "round tripping" the data was a desirable feature.  Your
thoughts on tainting still allow for manipulating strings with other strings
that are tainted in the same manner.

> The only thing we can control are the consequences
> of that data corruption. The aim of Python 3 is that the *default*
> behaviour will be to produce an exception (hopefully with enough info
> to let you debug the situation), but provide various tools to let you
> say "I prefer the risk of silent output corruption, thanks."
> 
Uh no.  In the discussion about unicode handling in the python3.0 timeframe
on python-dev, I brought up this problem and suggested that we were learning
the wrong lesson from python2.  ie: at that time we were learning that
throwing an error was the wrong thing to do even though there was plainly
a problem between the data and the programmer's assumptions.  My theory
was that it wasn't the error that was a python2 wart, it was the fact that
we threw the error too late, too infrequently, and just in general, in a way
that made it harder for the programmer to debug the errors the user would
see on their real-world systems.  The response at that time was that errors
were, in fact, the wart that people wanted to remove.

MvL then produced the surrogateescape handler which was supposed to address
these things by allowing people to round trip undecodable bytes into text
strings and back out again.  This was good in that we no longer *silently*
threw data away just because we couldn't decode it.  But it was bad as it
reintroduced the python2 problem of having a valid text string that portions
of the standard python3 framework (mostly stdlib functions) could not
handle.  This meant that the programmer could test print(os.listdir()) on
their system where all filenames were utf-8 and things would work.  But
a user could then run the same code on a system where the locale did not
match with the encoding of all the filenames and would get a traceback.
This was a portion of the python2 wart resurfaced.

Victor Stinner's work since then to integrate surrogateescape into more
stdlib functions has helped immensely to craft a unified strategy for
undecodable bytes.  print() still throws a traceback but most other places
where we take in and send out undecodable bytes just work.

Now, I'm not saying that the idea of annotating surogateescaped data with
information about the encoding it was created using is bad.  I am *leery*
(not yet convinced that the proposal is safe but also not yet convinced that
it is harmful... just knowing that there's a potential bad interaction
there) of throwing an exception as a result of examining that data as it
brings back a portion of the python2 wart of throwing an exception late in
the process rather than early, when the data is entering the system.  But,
just like surrogateescape itself, I can see that the idea is that the error
cases are supposed to shrink with this iteration -- at some point the hope
would be that the error conditions are small enough that the programmer no
longer has to care about them and perhaps this is that tipping point.

Some comments though -- if you're going to throw an error, don't throw
a ValueError.  It's immensely useful in python2 to get something
(UnicodeError) that is only thrown by an attempt to transform the data
wrongly for two reasons:  Lazy people can just catch UnicodeError and be
done with it.  People debugging issues can immediately see that the bug
falls into a (relatively) narrow space of issues.

If someone creates their own surrogateescaped string:
  se_str = 'abc\udcff'
the taint-checking machinery should allow that to be combined with strings
from any other sources.  It likely means that the programmer is working
around an API they don't control that takes text strings but should take
a bytes string.

-Toshio

Attachment: pgp_ukRX0mmOI.pgp
Description: PGP signature

_______________________________________________
python-devel mailing list
python-de...@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/python-devel

Reply via email to