On Mon, Mar 9, 2015 at 5:34 PM, Steven D'Aprano <steve+comp.lang.pyt...@pearwood.info> wrote: > Chris Angelico wrote: > >> As to the notion of rejecting the construction of strings containing >> these invalid codepoints, I'm not sure. Are there any languages out >> there that have a Unicode string type that requires that all >> codepoints be valid (no surrogates, no U+FFFE, etc)? > > U+FFFE and U+FFFF are *noncharacters*, not invalid. There are a total of 66 > noncharacters in Unicode, and they are legal in strings. > > http://www.unicode.org/faq/private_use.html#nonchar8 > > I think the only illegal code points are surrogates. Surrogates should only > appear as bytes in UTF-16 byte-strings.
U+FFFE would cause problems at the beginning of a UTF-16 stream, as it could be mistaken for a BOM - that's why it's a noncharacter. But sure, let's leave them out of the discussion. The question is whether surrogates are legal or not. ChrisA -- https://mail.python.org/mailman/listinfo/python-list