Eryk Sun <eryk...@gmail.com> added the comment: > instead of the stated 'surrogatepass'
In Python 3.6 and above, you can check this as follows: >>> sys.getfilesystemencoding() 'utf-8' >>> sys.getfilesystemencodeerrors() 'surrogatepass' In Python 3.5 and previous: >>> sys.getfilesystemencoding() 'mbcs' In 3.5, the error handler used by fsencode() and fsdecode() was hard coded as 'strict' for the 'mbcs' encoding, and otherwise 'surrogateescape'. > https://docs.python.org/3/library/os.html#os.fsencode > https://docs.python.org/3/library/os.html#os.fsdecode The above documentation needs to be updated to reference sys.getfilesystemencodeerrors(), as do the doc strings: >>> print(textwrap.dedent(os.fsencode.__doc__)) Encode filename to the filesystem encoding with 'surrogateescape' error handler, return bytes unchanged. On Windows, use 'strict' error handler if the file system encoding is 'mbcs' (which is the default encoding). >>> print(textwrap.dedent(os.fsdecode.__doc__)) Decode filename from the filesystem encoding with 'surrogateescape' error handler, return str unchanged. On Windows, use 'strict' error handler if the file system encoding is 'mbcs' (which is the default encoding). > https://docs.python.org/3/library/os.html#file-names-command-line-arguments-and-environment-variables This should be rewritten to link to sys.getfilesystemencodeerrors(). I'm fine with only discussing the use of "surrogateescape", which is a significant concern in POSIX systems, for which it is very easy and common for filenames to be created with an arbitrary encoding. I don't know if the use of "surrogatepass" in Windows warrants discussion. It is uncommon to need the error handler because the filesystem is Unicode. A user is unlikely to create a filename with an unpaired surrogate code. That said, before Windows 10, the legacy console allowed copying half of a surrogate pair to the clipboard, and a program could have a bug that nulls the second surrogate code in the pair (e.g. when limiting the length of a filename). Anyway, it's technically possible, so we support it. For example, "😈" (U+0001F608) is encoded in UTF-16 as the pair (U+D83D, U+DE08). A filename could end up with only the first of the two codes: >>> open('devil\ud83d', 'w').close() >>> print(ascii(os.listdir('.')[0])) 'devil\ud83d' ---------- _______________________________________ Python tracker <rep...@bugs.python.org> <https://bugs.python.org/issue43395> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com