On Thu, 29 Jul 2010 11:14:24 -0700, Ethan Furman wrote: > Don't think of unicode as a byte stream. It's a bunch of numbers that > map to a bunch of symbols.
Not only are Unicode strings a bunch of numbers ("code points", in Unicode terminology), but the numbers are not necessarily all the same width. The full Unicode system allows for 1,114,112 characters, far more than will fit in a two-byte code point. The Basic Multilingual Plane (BMP) includes the first 2**16 (65536) of those characters, or code points U+0000 through U+FFFF; there are a further 16 supplementary planes of 2**16 characters each, or code points U+10000 through U+10FFFF. As I understand it (and I welcome corrections), some implementations of Unicode only support the BMP and use a fixed-width implementation of 16- bit characters for efficiency reasons. Supporting the entire range of code points would require either a fixed-width of 21-bits (which would then probably be padded to four bytes), or a more complex variable-width implementation. It looks to me like Python uses a 16-bit implementation internally, which leads to some rather unintuitive results for code points in the supplementary place... >>> c = chr(2**18) >>> c '\U00040000' >>> len(c) 2 -- Steven -- http://mail.python.org/mailman/listinfo/python-list