Re: a little parsing challenge ☺

Benjamin Kaplan Mon, 18 Jul 2011 20:59:38 -0700

On Mon, Jul 18, 2011 at 7:07 PM, Billy Mays <no...@nohow.com> wrote:
>
> On 7/18/2011 7:56 PM, Steven D'Aprano wrote:
>>
>> Billy Mays wrote:
>>
>>> On 07/17/2011 03:47 AM, Xah Lee wrote:
>>>>
>>>> 2011-07-16
>>>
>>> I gave it a shot.  It doesn't do any of the Unicode delims, because
>>> let's face it, Unicode is for goobers.
>>
>> Goobers... that would be one of those new-fangled slang terms that the young
>> kids today use to mean its opposite, like "bad", "wicked" and "sick",
>> correct?
>>
>> I mention it only because some people might mistakenly interpret your words
>> as a childish and feeble insult against the 98% of the world who want or
>> need more than the 127 characters of ASCII, rather than understand you
>> meant it as a sign of the utmost respect for the richness and diversity of
>> human beings and their languages, cultures, maths and sciences.
>>
>>
>
> TL;DR version: international character sets are a problem, and Unicode is not 
> the answer to that problem).
>
> As long as I have used python (which I admit has only been 3 years) Unicode 
> has never appeared to be implemented correctly.  I'm probably repeating old 
> arguments here, but whatever.
>
> Unicode is a mess.  When someone says ASCII, you know that they can only mean 
> characters 0-127.  When someone says Unicode, do the mean real Unicode (and 
> is it 2 byte or 4 byte?) or UTF-32 or UTF-16 or UTF-8? When using the 'u' 
> datatype with the array module, the docs don't even tell you if its 2 bytes 
> wide or 4 bytes.  Which is it?  I'm sure that all the of these can be figured 
> out, but the problem is now I have to ask every one of these questions 
> whenever I want to use strings.
>


It doesn't matter. When you use the unicode data type in Python, you
get to treat it as a sequence of characters, not a sequence of bytes.
The fact that it's stored internally as UCS-2 or UCS-4 is irrelevant.

>
> Secondly, Python doesn't do Unicode exception handling correctly. (but I 
> suspect that its a broader problem with languages) A good example of this is 
> with UTF-8 where there are invalid code points ( such as 0xC0, 0xC1, 0xF5, 
> 0xF6, 0xF7, 0xF8, ..., 0xFF, but you already knew that, as well as everyone 
> else who wants to use strings for some reason).
>

A Unicode code point is of the form U+XXXX. 0xC0 is not a Unicode code
point, it is a byte. It happens to be an invalid byte using the UTF-8
byte encoding (which is not Unicode, it's a byte string). The Unicode
code point U+00C0 is perfectly valid- it's a LATIN CAPITAL LETTER A
WITH GRAVE.

>
> When embedding Python in a long running application where user input is 
> received, it is very easy to make mistake which bring down the whole program. 
>  If any user string isn't properly try/excepted, a user could craft a 
> malformed string which a UTF-8 decoder would choke on.  Using ASCII (or 
> whatever 8 bit encoding) doesn't have these problems since all codepoints are 
> valid.
>

UTF-8 != Unicode. UTF-8 is one of several byte encodings capable of
representing every character in the Unicode spec, but it is not
Unicode. If you have a Unicode string, it is not a sequence of byes,
it is a sequence of characters. If you want a sequence of bytes, use a
byte string. If you are attempting to interpret a sequence of bytes as
a sequence of text, you're doing it wrong. There's a reason we have
both text and binary modes for opening files- yes, there is a
difference between them.

> Another (this must have been a good laugh amongst the UniDevs) 'feature' of 
> unicode is the zero width space (UTF-8 code point 0xE2 0x80 0x8B). Any string 
> can masquerade as any other string by placing  few of these in a string.  Any 
> word filters you might have are now defeated by some cheesy Unicode nonsense 
> character.  Can you just just check for these characters and strip them out?  
> Yes.  Should you have to?  I would say no.
>
> Does it get better?  Of course! international character sets used for domain 
> name encoding use yet a different scheme (Punycode).  Are the following two 
> domain names the same: tést.com , xn--tst-bma.com ?  Who knows!
>
> I suppose I can gloss over the pains of using Unicode in C with every string 
> needing to be an LPS since 0x00 is now a valid code point in UTF-8 (0x0000 
> for 2 byte Unicode) or suffer the O(n) look up time to do strlen or 
> concatenation operations.
>

That is using UTF-8 in C. Which, again, is not the same thing as Unicode.

> Can it get even better?  Yep.  We also now need to have a Byte order Mark 
> (BOM) to determine the endianness of our characters.  Are they little endian 
> or big endian?  (or perhaps one of the two possible middle endian encodings?) 
>  Who knows?  String processing with unicode is unpleasant to say the least.  
> I suppose that's what we get when we things are designed by committee.
>

And that is UTF-16 and UTF-32. Again, those are byte encodings. They
are not Unicode. When you use a library capable of handling Unicode,
you never see those- you just have a string with characters in it.

> But Hey!  The great thing about standards is that there are so many to choose 
> from.
>
> --
> Bill
>
>
>
>
>
>
> --
> http://mail.python.org/mailman/listinfo/python-list
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: a little parsing challenge ☺

Reply via email to