In article <51440235$0$29965$c3e8da3$54964...@news.astraweb.com>, Steven D'Aprano <steve+comp.lang.pyt...@pearwood.info> wrote:
> UTF-32 is a *fixed width* storage mechanism where every code point takes > exactly four bytes. Since the entire Unicode range will fit in four > bytes, that ensures that every code point is covered, and there is no > need to walk the string every time you perform an indexing operation. But > it means that if you're one of the 99.9% of users who mostly use > characters in the BMP, your strings take twice as much space as > necessary. If you only use Latin1 or ASCII, your strings take four times > as much space as necessary. I suspect that eventually, UTF-32 will win out. I'm not sure when "eventually" is, but maybe sometime in the next 10-20 years. When I was starting out, the computer industry had a variety of character encodings designed to take up less than 8 bits per character. Sixbit, Rad-50, BCD, and so on. Each of these added complexity and took away character set richness, but saved a few bits. At the time, memory was so expensive and so precious, it was worth it. Over the years, memory became cheaper, address spaces grew from 16 to 32 to 64 bits, and the pressure to use richer character sets kept increasing. So, now we're at the point where people are (mostly) using Unicode, but are still arguing about which encoding to use because the "best" complexity/space tradeoff isn't obvious. At some point in the future, memory will be so cheap, and so ubiquitous, that people will be wondering why us neanderthals bothered worrying about trying to save 16 bits per character. Of course, by then, we'll be migrating to Mongocode and arguing about UTF-64 :-) -- http://mail.python.org/mailman/listinfo/python-list