On Jul 19, 7:07 am, Billy Mays <no...@nohow.com> wrote: > On 7/18/2011 7:56 PM, Steven D'Aprano wrote: > > > > > Billy Mays wrote: > > >> On 07/17/2011 03:47 AM, Xah Lee wrote: > >>> 2011-07-16 > > >> I gave it a shot. It doesn't do any of the Unicode delims, because > >> let's face it, Unicode is for goobers. > > > Goobers... that would be one of those new-fangled slang terms that the young > > kids today use to mean its opposite, like "bad", "wicked" and "sick", > > correct? > > > I mention it only because some people might mistakenly interpret your words > > as a childish and feeble insult against the 98% of the world who want or > > need more than the 127 characters of ASCII, rather than understand you > > meant it as a sign of the utmost respect for the richness and diversity of > > human beings and their languages, cultures, maths and sciences. > > TL;DR version: international character sets are a problem, and Unicode > is not the answer to that problem). > > As long as I have used python (which I admit has only been 3 years) > Unicode has never appeared to be implemented correctly. I'm probably > repeating old arguments here, but whatever. > > Unicode is a mess. When someone says ASCII, you know that they can only > mean characters 0-127. When someone says Unicode, do the mean real > Unicode (and is it 2 byte or 4 byte?) or UTF-32 or UTF-16 or UTF-8? > When using the 'u' datatype with the array module, the docs don't even > tell you if its 2 bytes wide or 4 bytes. Which is it? I'm sure that > all the of these can be figured out, but the problem is now I have to > ask every one of these questions whenever I want to use strings. > > Secondly, Python doesn't do Unicode exception handling correctly. (but I > suspect that its a broader problem with languages) A good example of > this is with UTF-8 where there are invalid code points ( such as 0xC0, > 0xC1, 0xF5, 0xF6, 0xF7, 0xF8, ..., 0xFF, but you already knew that, as > well as everyone else who wants to use strings for some reason). > > When embedding Python in a long running application where user input is > received, it is very easy to make mistake which bring down the whole > program. If any user string isn't properly try/excepted, a user could > craft a malformed string which a UTF-8 decoder would choke on. Using > ASCII (or whatever 8 bit encoding) doesn't have these problems since all > codepoints are valid. > > Another (this must have been a good laugh amongst the UniDevs) 'feature' > of unicode is the zero width space (UTF-8 code point 0xE2 0x80 0x8B). > Any string can masquerade as any other string by placing few of these > in a string. Any word filters you might have are now defeated by some > cheesy Unicode nonsense character. Can you just just check for these > characters and strip them out? Yes. Should you have to? I would say no. > > Does it get better? Of course! international character sets used for > domain name encoding use yet a different scheme (Punycode). Are the > following two domain names the same: tést.com , xn--tst-bma.com ? Who > knows! > > I suppose I can gloss over the pains of using Unicode in C with every > string needing to be an LPS since 0x00 is now a valid code point in > UTF-8 (0x0000 for 2 byte Unicode) or suffer the O(n) look up time to do > strlen or concatenation operations. > > Can it get even better? Yep. We also now need to have a Byte order > Mark (BOM) to determine the endianness of our characters. Are they > little endian or big endian? (or perhaps one of the two possible middle > endian encodings?) Who knows? String processing with unicode is > unpleasant to say the least. I suppose that's what we get when we > things are designed by committee. > > But Hey! The great thing about standards is that there are so many to > choose from. > > -- > Bill
Thanks for writing that Every time I try to understand unicode and remain stuck I come to the conclusion that I must be an imbecile. Seeing others (probably more intelligent than yours truly) gives me some solace! [And I am writing this from India where there are dozens of languages, almost as many scripts and everyone speaks and writes at least a couple of non-european ones] -- http://mail.python.org/mailman/listinfo/python-list