On Mon, Jul 18, 2011 at 7:07 PM, Billy Mays <no...@nohow.com> wrote: > > On 7/18/2011 7:56 PM, Steven D'Aprano wrote: >> >> Billy Mays wrote: >> >>> On 07/17/2011 03:47 AM, Xah Lee wrote: >>>> >>>> 2011-07-16 >>> >>> I gave it a shot. It doesn't do any of the Unicode delims, because >>> let's face it, Unicode is for goobers. >> >> Goobers... that would be one of those new-fangled slang terms that the young >> kids today use to mean its opposite, like "bad", "wicked" and "sick", >> correct? >> >> I mention it only because some people might mistakenly interpret your words >> as a childish and feeble insult against the 98% of the world who want or >> need more than the 127 characters of ASCII, rather than understand you >> meant it as a sign of the utmost respect for the richness and diversity of >> human beings and their languages, cultures, maths and sciences. >> >> > > TL;DR version: international character sets are a problem, and Unicode is not > the answer to that problem). > > As long as I have used python (which I admit has only been 3 years) Unicode > has never appeared to be implemented correctly. I'm probably repeating old > arguments here, but whatever. > > Unicode is a mess. When someone says ASCII, you know that they can only mean > characters 0-127. When someone says Unicode, do the mean real Unicode (and > is it 2 byte or 4 byte?) or UTF-32 or UTF-16 or UTF-8? When using the 'u' > datatype with the array module, the docs don't even tell you if its 2 bytes > wide or 4 bytes. Which is it? I'm sure that all the of these can be figured > out, but the problem is now I have to ask every one of these questions > whenever I want to use strings. >
It doesn't matter. When you use the unicode data type in Python, you get to treat it as a sequence of characters, not a sequence of bytes. The fact that it's stored internally as UCS-2 or UCS-4 is irrelevant. > > Secondly, Python doesn't do Unicode exception handling correctly. (but I > suspect that its a broader problem with languages) A good example of this is > with UTF-8 where there are invalid code points ( such as 0xC0, 0xC1, 0xF5, > 0xF6, 0xF7, 0xF8, ..., 0xFF, but you already knew that, as well as everyone > else who wants to use strings for some reason). > A Unicode code point is of the form U+XXXX. 0xC0 is not a Unicode code point, it is a byte. It happens to be an invalid byte using the UTF-8 byte encoding (which is not Unicode, it's a byte string). The Unicode code point U+00C0 is perfectly valid- it's a LATIN CAPITAL LETTER A WITH GRAVE. > > When embedding Python in a long running application where user input is > received, it is very easy to make mistake which bring down the whole program. > If any user string isn't properly try/excepted, a user could craft a > malformed string which a UTF-8 decoder would choke on. Using ASCII (or > whatever 8 bit encoding) doesn't have these problems since all codepoints are > valid. > UTF-8 != Unicode. UTF-8 is one of several byte encodings capable of representing every character in the Unicode spec, but it is not Unicode. If you have a Unicode string, it is not a sequence of byes, it is a sequence of characters. If you want a sequence of bytes, use a byte string. If you are attempting to interpret a sequence of bytes as a sequence of text, you're doing it wrong. There's a reason we have both text and binary modes for opening files- yes, there is a difference between them. > Another (this must have been a good laugh amongst the UniDevs) 'feature' of > unicode is the zero width space (UTF-8 code point 0xE2 0x80 0x8B). Any string > can masquerade as any other string by placing few of these in a string. Any > word filters you might have are now defeated by some cheesy Unicode nonsense > character. Can you just just check for these characters and strip them out? > Yes. Should you have to? I would say no. > > Does it get better? Of course! international character sets used for domain > name encoding use yet a different scheme (Punycode). Are the following two > domain names the same: tést.com , xn--tst-bma.com ? Who knows! > > I suppose I can gloss over the pains of using Unicode in C with every string > needing to be an LPS since 0x00 is now a valid code point in UTF-8 (0x0000 > for 2 byte Unicode) or suffer the O(n) look up time to do strlen or > concatenation operations. > That is using UTF-8 in C. Which, again, is not the same thing as Unicode. > Can it get even better? Yep. We also now need to have a Byte order Mark > (BOM) to determine the endianness of our characters. Are they little endian > or big endian? (or perhaps one of the two possible middle endian encodings?) > Who knows? String processing with unicode is unpleasant to say the least. > I suppose that's what we get when we things are designed by committee. > And that is UTF-16 and UTF-32. Again, those are byte encodings. They are not Unicode. When you use a library capable of handling Unicode, you never see those- you just have a string with characters in it. > But Hey! The great thing about standards is that there are so many to choose > from. > > -- > Bill > > > > > > > -- > http://mail.python.org/mailman/listinfo/python-list -- http://mail.python.org/mailman/listinfo/python-list