On Sun, 20 Mar 2016 10:22 pm, Chris Angelico wrote: > On Sun, Mar 20, 2016 at 10:06 PM, Steven D'Aprano <st...@pearwood.info> > wrote: >> The Unicode standard does not, as far as I am aware, care how you >> represent code points in memory, only that there are 0x110000 of them, >> numbered from U+0000 to U+10FFFF. That's what I mean by abstract. The >> obvious implementation is to use 32-bit integers, where 0x00000000 >> represents code point U+0000, 0x00000001 represents U+0001, and so forth. >> This is essentially equivalent to UTF-16, but it's not mandated or >> specified by the Unicode standard, you could, if you choose, use >> something else. > > (UTF-32)
D'oh! I mean, yes, well done, you have passed my little test to see if anyone is paying attention. Have a gold star. > The codepoints are not representable in *memory*; they are, by > definition, representable in a field of integers. They're not directly representable in memory because the definition of code points is not given in terms of memory values. Hence, they are abstract values, numbered in a certain way, and given certain semantics. In other words, there's nothing in the Unicode standard that says that code point U+0020 has to be stored as a byte 0x20, or a word 0x0020. But the standard does say that the code point U+0020 represents a space character. [...] >> On the other hand, I believe that the output of the UTF transformations >> is explicitly described in terms of 8-bit bytes and 16- or 32-bit words. >> For instance, the UTF-8 encoding of "A" has to be a single byte with >> value 0x41 (decimal 65). It isn't that this is the most obvious >> implementation, its that it can't be anything else and still be UTF-8. > > Exactly. Aside from the way UTF-16 and UTF-32 have LE and BE variants, Blame the chip manufacturers for that. Actually, I think we can blame Intel specifically for that, for reversing the normal layout of words in memory. -- Steven -- https://mail.python.org/mailman/listinfo/python-list