On 2017-01-21 10:50, Pete Forman wrote: > Thanks for a very thorough reply, most useful. I'm going to pick you up > on the above, though. > > Surrogates only exist in UTF-16. They are expressly forbidden in UTF-8 > and UTF-32. The rules for UTF-8 were tightened up in Unicode 4 and RFC > 3629 (2003). There is CESU-8 if you really need a naive encoding of > UTF-16 to UTF-8-alike. > > py> low = '\uDC37' > > is only meaningful on narrow builds pre Python 3.3 where the user must > do extra to correctly handle characters outside the BMP.
Hi Pete- Lone surrogate characters have a standardized use in Python, not just in narrow builds of Python <= 3.2. Unpaired high surrogate characters are used to store any bytes that couldn't be decoded with a given character encoding scheme, for use in OS/filesystem interfaces that use arbitrary byte strings: """ Python 3.6.0 (default, Dec 23 2016, 08:25:24) [GCC 5.4.0 20160609] on linux Type "help", "copyright", "credits" or "license" for more information. >>> s = 'héllo' >>> b = s.encode('latin-1') >>> b b'h\xe9llo' >>> from os import fsdecode, fsencode >>> decoded = fsdecode(b) >>> decoded 'h\udce9llo' >>> fsencode(decoded) b'h\xe9llo' """ This provides a mechanism for lossless round-trip decoding and encoding of arbitrary byte strings which aren't valid under the user's locale. This is absolutely necessary in POSIX systems in which filenames can contain any sequence of bytes despite the user's locale, and is even necessary in Windows, where filenames are stored as opaque not-quite-UCS2 strings: """ Python 3.6.0 (v3.6.0:41df79263a11, Dec 23 2016, 08:06:12) [MSC v.1900 64 bit (AMD64)] on win32 Type "copyright", "credits" or "license()" for more information. >>> from pathlib import Path >>> import os >>> os.chdir(Path('~/Desktop').expanduser()) >>> filename = '\udcf9' >>> with open(filename, 'w'): pass >>> os.listdir('.') ['desktop.ini', '\udcf9'] """ MMR... -- https://mail.python.org/mailman/listinfo/python-list