[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-22 Thread Benjamin Peterson
Benjamin Peterson added the comment: I think it would be great to have a "Unicode/bytes" howto with information like this included. -- ___ Python tracker ___ ___

[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-22 Thread Nick Coghlan
Nick Coghlan added the comment: On 23 Aug 2013 01:40, "R. David Murray" wrote: . (I double checked, and this does indeed work...doing the equivalent of ls >temp via python preserves the bytes with that PYTHONIOENCODING setting. I don't quite understand, however, why I get the � chars if I don'

[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-22 Thread STINNER Victor
STINNER Victor added the comment: >>> ('\udcff' + 'qwerty').encode('utf-16le', 'surrogateescape') b'\xff\xdcq\x00w\x00e\x00r\x00t\x00y\x00' Oh, this is a bug in the UTF-16 encoder: it should not encode surrogate characters => see issue #12892 I read that it's possible to set a standard stream

[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-22 Thread R. David Murray
R. David Murray added the comment: If you pipe the ls (eg: ls >temp) the bytes are preserved. Since setting the escape handler via PYTHONIOENCODING sets it for both stdin in and stdout, it sounds like that solves the sysadmin use case. The sysadmin can just put that environment variable sett

[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-22 Thread Nick Coghlan
Nick Coghlan added the comment: Note that the specific case I'm really interested is printing on systems that are properly configured to use UTF-8, but are getting bad metadata from an OS API. I'm OK with the idea of *only* changing it for UTF-8 rather than for arbitrary encodings, as well as

[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-22 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: >>> ('\udcff' + 'qwerty').encode('utf-16le', 'surrogateescape') b'\xff\xdcq\x00w\x00e\x00r\x00t\x00y\x00' >>> ('\udcff' + 'qwerty').encode('utf-16le', >>> 'surrogateescape').decode('utf-16le', 'surrogateescape') '\udcff\udcdcqwerty' >>> ('\udcff' + 'qwerty').e

[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-22 Thread STINNER Victor
STINNER Victor added the comment: > The surrogateescape error handler is dangerous with utf-16/32. It can produce > globally invalid output. I don't understand, can you give an example? surrogateescape generate invalid encoded string with any encoding. Example with UTF-8: >>> b"a\xffb".decode

[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-22 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: The surrogateescape error handler is dangerous with utf-16/32. It can produce globally invalid output. -- ___ Python tracker ___

[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-22 Thread STINNER Victor
STINNER Victor added the comment: > Is it a bug in your patch, or is it deliberate? It was not deliberate, and I think that it would be more consistent to use the same error handler (surrogateescape) when only the encoding is changed by the PYTHONIOENCODING environment variable. So surrogateesca

[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-22 Thread Antoine Pitrou
Antoine Pitrou added the comment: Is it a bug in your patch, or is it deliberate? -- ___ Python tracker ___ ___ Python-bugs-list maili

[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-22 Thread STINNER Victor
STINNER Victor added the comment: >> Serhiy Storchaka also noticed (in the review of my patch) than errors >> is "strict" when PYTHONIOENCODING=utf-8 is used. We should also use >> surrogateescape if only the encoding is changed. > I don't understand what you say. Could you rephrase? With my pat

[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-22 Thread Antoine Pitrou
Antoine Pitrou added the comment: > Serhiy Storchaka also noticed (in the review of my patch) than errors > is "strict" when PYTHONIOENCODING=utf-8 is used. We should also use > surrogateescape if only the encoding is changed. I don't understand what you say. Could you rephrase? -- ___

[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-22 Thread Antoine Pitrou
Antoine Pitrou added the comment: > See my message msg195769: Python3 cannot be simply used as a pipe > because it wants to be kind by decoding binary data to Unicode, > whereas no everybody cares of Unicode :-) If somebody doesn't care about unicode, they can use sys.stdin.buffer. Problem solve

[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-22 Thread STINNER Victor
STINNER Victor added the comment: Serhiy Storchaka added the comment: > Shouldn't be safer use surrogateescape for output and strict for input. Nick wrote "Think sysadmins running scripts on Linux, writing to the console or a pipe." See my message msg195769: Python3 cannot be simply used as a p

[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-22 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: Shouldn't be safer use surrogateescape for output and strict for input. -- ___ Python tracker ___

[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-22 Thread STINNER Victor
STINNER Victor added the comment: > I'm only saying that this will increase a number of cases > when an exception will raised in unexpected place. The print() instruction is much more common than input(). IMO changing the error handle should fix more issues than adding regressions. Python funct

[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-22 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: > "The surrogateescape error handler works with any codec." Ah, sorry. You are correct. > Correct, but it's not something new: os.listdir(), sys.argv, os.environ and other functions using os.fsdecode(). Applications should already have to support surrogates.

[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-21 Thread STINNER Victor
STINNER Victor added the comment: "The surrogateescape error handler works with any codec." The surrogatepass only works with utf-8 if I remember correctly. The surrogateescape error handler works with any codec, especially ascii. "As a side effect of this change an input from stdin will be inc

[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-21 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: The surrogateescape error handler works only with UTF-8. As a side effect of this change an input from stdin will be incompatible in general with extensions which implicitly encode a string to bytes with UTF-8 (e.g. tkinter, XML parsers, sqlite3, datetime, l

[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-21 Thread STINNER Victor
STINNER Victor added the comment: Attached patch changes the error handle of stdin, stdout and stderr to surrogateescape by default. It can still be changed explicitly using the PYTHONIOENCODING environment variable. -- keywords: +patch Added file: http://bugs.python.org/file31414/surr

[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-21 Thread Arfrever Frehtes Taifersar Arahesis
Changes by Arfrever Frehtes Taifersar Arahesis : -- nosy: +Arfrever ___ Python tracker ___ ___ Python-bugs-list mailing list Unsubscri

[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-21 Thread STINNER Victor
STINNER Victor added the comment: Currently, Python 3 fails miserabily when it gets a non-ASCII character from stdin or when it tries to write a byte encoded as a Unicode surrogate to stdout. It works fine when OS data can be decoded from and encoded to the locale encoding. Example on Linux with

[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-21 Thread R. David Murray
R. David Murray added the comment: I think the essential use case is using a python program in a unix pipeline. I'm very sympathetic to that use case, despite my unease. -- ___ Python tracker

[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-20 Thread Antoine Pitrou
Antoine Pitrou added the comment: > After some thought, Nick came up with this solution. The idea is that > surrogateescape was originally accepted to allow roundtripping data > from the OS and back when the OS considers it to be a "string" but > python does not consider it to be "text". When t

[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-20 Thread Nick Coghlan
Nick Coghlan added the comment: Think sysadmins running scripts on Linux, writing to the console or a pipe. I agree the generalisation is a bad idea, so only consider the original proposal that was specifically limited to the standard streams. Specifically, if a system is properly configured to

[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-20 Thread STINNER Victor
STINNER Victor added the comment: 2013/8/21 Nick Coghlan : > Which reminds me: I'm curious what "ls" currently does for malformed > filenames. The aim of this change would be to get 'python -c "import os; > print(os.listdir())"' to do the best it can to work without losing data in > such a situat

[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-20 Thread STINNER Victor
STINNER Victor added the comment: On Linux, the locale encoding is usually UTF-8. If a filename cannot be decoded from UTF-8, invalid bytes are escaped to the surrogate range using the PEP 383. If I create a UTF-8 text file and I try to write the filename into this text file, the Python UTF-8 enc

[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-20 Thread Nick Coghlan
Nick Coghlan added the comment: Which reminds me: I'm curious what "ls" currently does for malformed filenames. The aim of this change would be to get 'python -c "import os; print(os.listdir())"' to do the best it can to work without losing data in such a situation. -- _

[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-20 Thread Toshio Kuratomi
Toshio Kuratomi added the comment: Nick and I had talked about this at a recent conference and came to it from different directions. On the one hand, Nick made the point that any encoding of surrogateescape'd text to bytes via a different encoding is corrupting the data as a whole. On the ot

[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-12 Thread Nick Coghlan
Nick Coghlan added the comment: Everything about surrogateescape is dangerous - we're trying to work around the presence of bad data by at least allowing it to be tunnelled through Python code without corrupting it further :) -- ___ Python tracker

[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-12 Thread R. David Murray
R. David Murray added the comment: My gut reaction to this is that it feels dangerous. That doesn't mean my gut is right, I'm just reporting my reaction :) -- nosy: +r.david.murray ___ Python tracker

[issue18713] Enable surrogateescape on stdin and stdout when appropriate

2013-08-12 Thread Nick Coghlan
New submission from Nick Coghlan: One problem with Unicode in 3.x is that surrogateescape isn't normally enabled on stdin and stdout. This means the following code will fail with UnicodeEncodeError in the presence of invalid filesystem metadata: print(os.listdir()) We don't really want to