Steven D'Aprano <st...@pearwood.info>: > On Sun, 20 Mar 2016 03:12 am, Marko Rauhamaa wrote: >> Steven D'Aprano <st...@pearwood.info>: >>> On Sun, 20 Mar 2016 02:02 am, Marko Rauhamaa wrote: >>>> Yes, but UTF-16 produces 16-bit values that are outside Unicode. >>> >>> Show me. >>> >>> Before you answer, if your answer is "surrogate pairs", that is >>> incorrect. Surrogate pairs is how UTF-16 encodes astral characters. >> >> UTF-16 inputs a Unicode stream and produces a stream of 16-bit numbers. >> Thus, the output of UTF-16 is not Unicode. > > [...] > > If your point is that the data you get from running UTF-16 on a > sequence of code points is "not Unicode, but 2-byte words", then I > agree, but I'm not sure why you think that's significant.
I say the surrogate characters are not Unicode. You say they are because they are used to encode astral characters. I say that point is irrelevant. I'm saying the surrogate characters are not Unicode because you are not allowed to store or communicate them. They are a hole in the Unicode fabric. They could have—probably should have—specified a UTF-16 encoding for the surrogate characters as well. That would have left the Unicode range uninterrupted. Well, the silver lining is that Python gained a number of extra code points it was free to use for special purposes, although to be faithful to Unicode, Python should refuse to store them. Marko -- https://mail.python.org/mailman/listinfo/python-list