On Mon, Jul 22, 2013 at 05:15:50PM +1000, Nick Coghlan wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > On 07/22/2013 03:25 PM, Toshio Kuratomi wrote: > > > If python3 could just finally fix outputting text with > > surrogateescaped bytes then it would finally clean up the last > > portion of this and I would be able to stop pointing out the > > various ways that python3's unicode handling is just as broken as > > pyhton2's -- just in different ways. :-) > > Attempting to encode data containing surrogate escapes without setting > "errors=surrogateescape" is a sign that tainted data has escaped > somewhere. So it's late notification of an error, but still indicative > of an error somewhere. We'll never silence it by default. > That's a bit simplified from what python3's direction on this is unless Victor Stinner's work is only intended to be temporary.
$ export LC_ALL=en_US.utf-8 $ mkdir abc$'\xff' $ python3.3 >>> import os >>> se_dirlisting = os.listdir('.') >>> # surrogateescape in a text string: >>> repr(se_dirlisting[0]) "'abc\\udcff'" >>> # This doesn't traceback and it has to encode se_dirlisting when passing >>> # it out of python: >>> os.listdir(se_dirlisting[0]) [] >>> # Works with other modules as well: >>> import subprocess >>> subprocess.call(['ls', se_dirlisting[0]]) 0 AFAIK, the justification is that the surrogateescape'd strings are both coming from and going to the OS. They're crossing outside of the line that python3 draws around itself and there's an implicit encoding and decoding there. This seems fine to me as a strategy. The problems are just that there are places where python3 doesn't yet use surrogateescape when crossing this boundary. The one I was specifically thinking of when I wrote this was the print() function: >>> print(se_dirlisting[0]) Traceback (most recent call last): File "<stdin>", line 3, in <module> UnicodeEncodeError: 'utf-8' codec can't encode character '\udcff' in position 3: surrogates not allowed (When I mentioned this at pycon you brought up: http://bugs.python.org/issue15216 which looked promising but seems to have stalled ;-) > > > >> * Inputs tainted with different assumptions? Immediate > >> ValueError complaining about the taint mismatch > >> > >> String encoding would be updated to trigger a ValueError when > >> asked to encode tainted strings to an encoding other than the > >> tainted one. > >> > > > > I'm a little leery of these. The reason is that after using both > > python2 and the early versions of python3 I became a firm believer > > that the problem with python2's unicode handling wasn't that it > > threw exceptions, rather the problem was that the same bit of code > > was too prone to passing through certain data without error and > > throwing an error with other data. Programmers who tested their > > code with only ascii data or only data encoded in their locale's > > encoding, or only when their locale was a utf-8 encoding were > > unable to replicate or understand the errors that their user's got > > when they ran them in the crazy real-world environments that user's > > inevitably have. These rules that throw an Exception suffer from > > the same reliance on the specific data and environment and will > > lead to similar tracebacks that programmers won't be able to easily > > replicate. > > You can't get away from this problem: decoding with the wrong encoding > corrupts your data. > Incorrect. Decoding with the wrong encoding and then attempting to operate on the decoded data in certain way will corrupt your data. Operating on it in certain other ways will not. surrogateescape was designed because people realized that "round tripping" the data was a desirable feature. Your thoughts on tainting still allow for manipulating strings with other strings that are tainted in the same manner. > The only thing we can control are the consequences > of that data corruption. The aim of Python 3 is that the *default* > behaviour will be to produce an exception (hopefully with enough info > to let you debug the situation), but provide various tools to let you > say "I prefer the risk of silent output corruption, thanks." > Uh no. In the discussion about unicode handling in the python3.0 timeframe on python-dev, I brought up this problem and suggested that we were learning the wrong lesson from python2. ie: at that time we were learning that throwing an error was the wrong thing to do even though there was plainly a problem between the data and the programmer's assumptions. My theory was that it wasn't the error that was a python2 wart, it was the fact that we threw the error too late, too infrequently, and just in general, in a way that made it harder for the programmer to debug the errors the user would see on their real-world systems. The response at that time was that errors were, in fact, the wart that people wanted to remove. MvL then produced the surrogateescape handler which was supposed to address these things by allowing people to round trip undecodable bytes into text strings and back out again. This was good in that we no longer *silently* threw data away just because we couldn't decode it. But it was bad as it reintroduced the python2 problem of having a valid text string that portions of the standard python3 framework (mostly stdlib functions) could not handle. This meant that the programmer could test print(os.listdir()) on their system where all filenames were utf-8 and things would work. But a user could then run the same code on a system where the locale did not match with the encoding of all the filenames and would get a traceback. This was a portion of the python2 wart resurfaced. Victor Stinner's work since then to integrate surrogateescape into more stdlib functions has helped immensely to craft a unified strategy for undecodable bytes. print() still throws a traceback but most other places where we take in and send out undecodable bytes just work. Now, I'm not saying that the idea of annotating surogateescaped data with information about the encoding it was created using is bad. I am *leery* (not yet convinced that the proposal is safe but also not yet convinced that it is harmful... just knowing that there's a potential bad interaction there) of throwing an exception as a result of examining that data as it brings back a portion of the python2 wart of throwing an exception late in the process rather than early, when the data is entering the system. But, just like surrogateescape itself, I can see that the idea is that the error cases are supposed to shrink with this iteration -- at some point the hope would be that the error conditions are small enough that the programmer no longer has to care about them and perhaps this is that tipping point. Some comments though -- if you're going to throw an error, don't throw a ValueError. It's immensely useful in python2 to get something (UnicodeError) that is only thrown by an attempt to transform the data wrongly for two reasons: Lazy people can just catch UnicodeError and be done with it. People debugging issues can immediately see that the bug falls into a (relatively) narrow space of issues. If someone creates their own surrogateescaped string: se_str = 'abc\udcff' the taint-checking machinery should allow that to be combined with strings from any other sources. It likely means that the programmer is working around an API they don't control that takes text strings but should take a bytes string. -Toshio
pgp_ukRX0mmOI.pgp
Description: PGP signature
_______________________________________________ python-devel mailing list python-de...@lists.fedoraproject.org https://admin.fedoraproject.org/mailman/listinfo/python-devel