Xah Lee <[EMAIL PROTECTED]> wrote: > " It's very wasteful of space. In most texts, the majority of the >code points are less than 127, or less than 255, so a lot of space is >occupied by zero bytes. " > >Not true. In Asia, most chars has unicode number above 255. Considered >globally, *possibly* today there are more computer files in Chinese >than in all latin-alphabet based lang.
This doesn't hold water. There are many good reasons for preferring UTF16 over UTF8, but unless you know you're only ever going to be handling scripts from Unicode blocks above Arabic, it's reasonable to assume that UTF8 will be at least as compact. Consider that transcoding a Chinese file from UTF16 to UTF8 will probably increase its size by 50% (the CJK ideograph blocks encode to 3 bytes). While transcoding a document in a Western European langauge the other way can be expected to increase its size by up to 100% (every single- byte character is doubled). You'd have to be talking about double to volume of CJK data before switching from UTF8 to UTF16 becomes even a break-even proposition space-wise. (It's curious to note that the average word length in English is often taken to be 6 letters. Similarly, in UTF8-encoded Chinese the average word length is 6 bytes....) -- \S -- [EMAIL PROTECTED] -- http://www.chaos.org.uk/~sion/ "Frankly I have no feelings towards penguins one way or the other" -- Arthur C. Clarke her nu becomeþ se bera eadward ofdun hlæddre heafdes bæce bump bump bump
-- http://mail.python.org/mailman/listinfo/python-list