New submission from David Watson <bai...@users.sourceforge.net>: The mbstowcs and mbrtwoc functions which are used for the initial conversion of command-line arguments on Unix can return lone or paired surrogates (e.g. \udcff for \xed\xb3\xbf in non-strict UTF-8), and these surrogates are currently placed into sys.argv unescaped. This creates various problems such as strings that cannot be re-encoded into bytes and strings that could represent more than one byte sequence. Examples follow using the following script in a UTF-8 locale on Linux:
import sys print(repr(sys.argv[1])) print(repr(sys.argv[1].encode(sys.getfilesystemencoding(), "surrogateescape"))) Strings that cannot be re-encoded: $ ./python argtest.py $'\xed\xa0\x80' '\ud800' Traceback (most recent call last): File "argtest.py", line 6, in <module> print(repr(sys.argv[1].encode(sys.getfilesystemencoding(), "surrogateescape"))) UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 0: surrogates not allowed $ ./python argtest.py $'\xed\xb0\x80' '\udc00' Traceback (most recent call last): File "argtest.py", line 6, in <module> print(repr(sys.argv[1].encode(sys.getfilesystemencoding(), "surrogateescape"))) UnicodeEncodeError: 'utf-8' codec can't encode character '\udc00' in position 0: surrogates not allowed Aliasing between non-decodable bytes and encoded lone surrogates: $ ./python argtest.py $'\xff' '\udcff' b'\xff' $ ./python argtest.py $'\xed\xb3\xbf' '\udcff' b'\xff' Aliasing between encoding of a non-BMP character and encoding of its UTF-16 representation (on narrow Unicode builds): $ ./python argtest.py $'\xf0\x90\x80\x80' '\U00010000' b'\xf0\x90\x80\x80' $ ./python argtest.py $'\xed\xa0\x80\xed\xb0\x80' '\U00010000' b'\xf0\x90\x80\x80' Attached is a patch to fix these problems by replacing any decoded characters in the range 0xd800...0xdfff with the surrogateescape encodings of their source bytes. ---------- files: escape-surrogates.diff keywords: patch messages: 88272 nosy: baikie severity: normal status: open title: Encoded surrogate characters on command line not escaped in sys.argv type: behavior versions: Python 3.1, Python 3.2 Added file: http://bugs.python.org/file14054/escape-surrogates.diff _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue6097> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com