Re: Newbie question about text encoding

Chris Angelico Sun, 08 Mar 2015 23:47:03 -0700

On Mon, Mar 9, 2015 at 5:34 PM, Steven D'Aprano
<[email protected]> wrote:
> Chris Angelico wrote:
>
>> As to the notion of rejecting the construction of strings containing
>> these invalid codepoints, I'm not sure. Are there any languages out
>> there that have a Unicode string type that requires that all
>> codepoints be valid (no surrogates, no U+FFFE, etc)?
>
> U+FFFE and U+FFFF are *noncharacters*, not invalid. There are a total of 66
> noncharacters in Unicode, and they are legal in strings.
>
> http://www.unicode.org/faq/private_use.html#nonchar8
>
> I think the only illegal code points are surrogates. Surrogates should only
> appear as bytes in UTF-16 byte-strings.


U+FFFE would cause problems at the beginning of a UTF-16 stream, as it
could be mistaken for a BOM - that's why it's a noncharacter. But
sure, let's leave them out of the discussion. The question is whether
surrogates are legal or not.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

Reply via email to