Nick Coghlan added the comment: Note that the specific case I'm really interested is printing on systems that are properly configured to use UTF-8, but are getting bad metadata from an OS API. I'm OK with the idea of *only* changing it for UTF-8 rather than for arbitrary encodings, as well as restricting it to sys.stdout when the codec used matches the default filesystem encoding.
To double check the current behaviour, I created a directory to tinker with this. Filenames were created with the following: >>> open("ℙƴ☂ℌøἤ".encode("utf-8"), "w") >>> open("basic_ascii".encode("utf-8"), "w") >>> b"\xd0\xd1\xd2\xd3".decode("latin-1") 'ÐÑÒÓ' >>> open(b"\xd0\xd1\xd2\xd3", "w") That last generates an invalid UTF-8 filename. "ls" actually degrades less gracefully than I thought, and just prints question marks for the bad file: $ ls -l total 0 -rw-rw-r--. 1 ncoghlan ncoghlan 0 Aug 23 00:04 ???? -rw-rw-r--. 1 ncoghlan ncoghlan 0 Aug 23 00:01 basic_ascii -rw-rw-r--. 1 ncoghlan ncoghlan 0 Aug 23 00:01 ℙƴ☂ℌøἤ Python 2 & 3 both work OK if you just print the directory listing directly, since repr() happily displays the surrogate escaped string: $ python -c "import os; print(os.listdir('.'))" ['basic_ascii', '\xd0\xd1\xd2\xd3', '\xe2\x84\x99\xc6\xb4\xe2\x98\x82\xe2\x84\x8c\xc3\xb8\xe1\xbc\xa4'] $ python3 -c "import os; print(os.listdir('.'))" ['basic_ascii', '\udcd0\udcd1\udcd2\udcd3', 'ℙƴ☂ℌøἤ'] Where it falls down is when you try to print the strings directly in Python 3: $ python3 -c "import os; [print(fname) for fname in os.listdir('.')]" basic_ascii Traceback (most recent call last): File "<string>", line 1, in <module> File "<string>", line 1, in <listcomp> UnicodeEncodeError: 'utf-8' codec can't encode character '\udcd0' in position 0: surrogates not allowed While setting the IO encoding produces behaviour closer to that of the native tools: $ PYTHONIOENCODING=utf-8:surrogateescape python3 -c "import os; [print(fname) for fname in os.listdir('.')]" basic_ascii ���� ℙƴ☂ℌøἤ On the other hand, setting PYTHONIOENCODING as shown provides an environmental workaround, and http://bugs.python.org/issue15216 will provide an improved programmatic workaround (which tools like http://code.google.com/p/pyp/ could use to configure surrogateescape by default). So perhaps pursuing #15216 further would be a better approach than selectively changing the default behaviour? And better documentation for ways to handle the surrogate escape error when it arises? ---------- _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue18713> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com