Billy Mays wrote: > TL;DR version: international character sets are a problem, and Unicode > is not the answer to that problem).
Shorter version: FUD. Yes, having a rich and varied character set requires work. Yes, the Unicode standard itself, and any interface to it (including Python's) are imperfect (like anything created by fallible humans). But your post is a long and tedious list of FUD with not one bit of useful advice. I'm not going to go through the whole post -- life is too short. But here are two especially egregious example showing that you have some fundamental misapprehensions about what Unicode actually is: > Python doesn't do Unicode exception handling correctly. (but I > suspect that its a broader problem with languages) A good example of > this is with UTF-8 where there are invalid code points ( such as 0xC0, > 0xC1, 0xF5, 0xF6, 0xF7, 0xF8, ..., 0xFF, but you already knew that, as > well as everyone else who wants to use strings for some reason). and then later: > Another (this must have been a good laugh amongst the UniDevs) 'feature' > of unicode is the zero width space (UTF-8 code point 0xE2 0x80 0x8B). This is confused. Unicode text has code points, text which has been encoded is nothing but bytes and not code points. "UTF-8 code point" does not even mean anything. The zero width space has code point U+200B. The bytes you get depend on which encoding you want: >>> zws = u'\N{Zero Width Space}' >>> zws u'\u200b' >>> zws.encode('utf-8') '\xe2\x80\x8b' >>> zws.encode('utf-16') '\xff\xfe\x0b ' But regardless of which bytes it is encoded into, ZWS always has just a single code point: U+200B. You say "A good example of this is with UTF-8 where there are invalid code points ( such as 0xC0, 0xC1, 0xF5, 0xF6, 0xF7, 0xF8, ..., 0xFF" but I don't even understand why you think this is a problem with Unicode. 0xC0 is not a code point, it is a byte. Not all combinations of bytes are legal in all files. If you have byte 0xC0 in a file, it cannot be an ASCII file: there is no ASCII character represented by byte 0xC0, because hex 0xCO = 192, which is larger than 127. Likewise, if you have a 0xC0 byte in a file, it cannot be UTF-8. It is as simple as that. Trying to treat it as UTF-8 will give an error, just as trying to view a mp3 file as if it were a jpeg will give an error. Why you imagine this is a problem for Unicode is beyond me. -- Steven -- http://mail.python.org/mailman/listinfo/python-list