On Sun, Mar 20, 2016 at 10:06 PM, Steven D'Aprano <st...@pearwood.info> wrote: > The Unicode standard does not, as far as I am aware, care how you represent > code points in memory, only that there are 0x110000 of them, numbered from > U+0000 to U+10FFFF. That's what I mean by abstract. The obvious > implementation is to use 32-bit integers, where 0x00000000 represents code > point U+0000, 0x00000001 represents U+0001, and so forth. This is > essentially equivalent to UTF-16, but it's not mandated or specified by the > Unicode standard, you could, if you choose, use something else.
(UTF-32) The codepoints are not representable in *memory*; they are, by definition, representable in a field of integers. If you choose to represent those integers as little-endian 32-bit values, then yes, the layout in memory will look like UTF-32LE, but that's because UTF-32LE is defined in this extremely simple way. In fact, that's exactly how the layers work - Unicode defines a mapping of characters to code points, and then UTF-x defines a mapping of code points to bytes. > On the other hand, I believe that the output of the UTF transformations is > explicitly described in terms of 8-bit bytes and 16- or 32-bit words. For > instance, the UTF-8 encoding of "A" has to be a single byte with value 0x41 > (decimal 65). It isn't that this is the most obvious implementation, its > that it can't be anything else and still be UTF-8. Exactly. Aside from the way UTF-16 and UTF-32 have LE and BE variants, there is only one bitpattern for any given character sequence and UTF-x (so if you work with eg "UTF-16LE", there's only one). This is no accident. Unlike some encodings, in which there's a "one most obvious" way to encode things but then a number of other legal ways, UTF-x can be compared for equality [1] using simple byte-for-byte comparisons. This means you don't have to worry about someone sneaking a magic character past your filter; if you're checking a UTF-8 stream for the character U+003C LESS-THAN SIGN, the only byte value to look for is 0x3C - the sequence 0xC0 0xBC, despite mathematically representing the number 003C, is explicitly forbidden. ChrisA [1] Though not inequality - lexical sorting doesn't follow codepoint order, and codepoint order won't always match byte order. But equality is easy. -- https://mail.python.org/mailman/listinfo/python-list