On Sun, 20 Mar 2016 05:20 pm, Rustom Mody wrote: > On Sunday, March 20, 2016 at 10:32:07 AM UTC+5:30, Steven D'Aprano wrote: >> On Sun, 20 Mar 2016 03:12 am, Marko Rauhamaa wrote: >> >> > Steven D'Aprano : >> > >> >> On Sun, 20 Mar 2016 02:02 am, Marko Rauhamaa wrote: >> >>> Yes, but UTF-16 produces 16-bit values that are outside Unicode. >> >> >> >> Show me. >> >> >> >> Before you answer, if your answer is "surrogate pairs", that is >> >> incorrect. Surrogate pairs is how UTF-16 encodes astral characters. >> > >> > UTF-16 inputs a Unicode stream and produces a stream of 16-bit numbers. >> > Thus, the output of UTF-16 is not Unicode. >> >> I'm not sure what point you think you are making. >> >> Unicode (the character set part of it) is a set of abstract 23-bit >> numbers, > > 23? Or 21?
Oops, you're right, its 21 bits. > More pertinently if the number of bits signifies, whatever is the sense of > the word 'abstract'? The Unicode standard does not, as far as I am aware, care how you represent code points in memory, only that there are 0x110000 of them, numbered from U+0000 to U+10FFFF. That's what I mean by abstract. The obvious implementation is to use 32-bit integers, where 0x00000000 represents code point U+0000, 0x00000001 represents U+0001, and so forth. This is essentially equivalent to UTF-16, but it's not mandated or specified by the Unicode standard, you could, if you choose, use something else. On the other hand, I believe that the output of the UTF transformations is explicitly described in terms of 8-bit bytes and 16- or 32-bit words. For instance, the UTF-8 encoding of "A" has to be a single byte with value 0x41 (decimal 65). It isn't that this is the most obvious implementation, its that it can't be anything else and still be UTF-8. -- Steven -- https://mail.python.org/mailman/listinfo/python-list