Nick Coghlan added the comment:

Note that the specific case I'm really interested is printing on systems that 
are properly configured to use UTF-8, but are getting bad metadata from an OS 
API. I'm OK with the idea of *only* changing it for UTF-8 rather than for 
arbitrary encodings, as well as restricting it to sys.stdout when the codec 
used matches the default filesystem encoding.

To double check the current behaviour, I created a directory to tinker with 
this. Filenames were created with the following:

>>> open("ℙƴ☂ℌøἤ".encode("utf-8"), "w")
>>> open("basic_ascii".encode("utf-8"), "w")
>>> b"\xd0\xd1\xd2\xd3".decode("latin-1")
'ÐÑÒÓ'
>>> open(b"\xd0\xd1\xd2\xd3", "w")

That last generates an invalid UTF-8 filename. "ls" actually degrades less 
gracefully than I thought, and just prints question marks for the bad file:

$ ls -l
total 0
-rw-rw-r--. 1 ncoghlan ncoghlan 0 Aug 23 00:04 ????
-rw-rw-r--. 1 ncoghlan ncoghlan 0 Aug 23 00:01 basic_ascii
-rw-rw-r--. 1 ncoghlan ncoghlan 0 Aug 23 00:01 ℙƴ☂ℌøἤ

Python 2 & 3 both work OK if you just print the directory listing directly, 
since repr() happily displays the surrogate escaped string:

$ python -c "import os; print(os.listdir('.'))"
['basic_ascii', '\xd0\xd1\xd2\xd3', 
'\xe2\x84\x99\xc6\xb4\xe2\x98\x82\xe2\x84\x8c\xc3\xb8\xe1\xbc\xa4']
$ python3 -c "import os; print(os.listdir('.'))"
['basic_ascii', '\udcd0\udcd1\udcd2\udcd3', 'ℙƴ☂ℌøἤ']

Where it falls down is when you try to print the strings directly in Python 3:

$ python3 -c "import os; [print(fname) for fname in os.listdir('.')]"
basic_ascii
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "<string>", line 1, in <listcomp>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcd0' in position 
0: surrogates not allowed

While setting the IO encoding produces behaviour closer to that of the native 
tools:
$ PYTHONIOENCODING=utf-8:surrogateescape python3 -c "import os; [print(fname) 
for fname in os.listdir('.')]"
basic_ascii
����
ℙƴ☂ℌøἤ

On the other hand, setting PYTHONIOENCODING as shown provides an environmental 
workaround, and http://bugs.python.org/issue15216 will provide an improved 
programmatic workaround (which tools like http://code.google.com/p/pyp/ could 
use to configure surrogateescape by default).

So perhaps pursuing #15216 further would be a better approach than selectively 
changing the default behaviour? And better documentation for ways to handle the 
surrogate escape error when it arises?

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue18713>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to