At the moment, I'm not particularly inclined to argue unicode. Short of 
Larry handing down an edict and invoking Rule #1, the following rules will 
be in effect:

1) All Unicode data perl does regular expressions against will be in 
Normalization Form C, except for...
2) Regexes tagged to run against a decomposed form will instead be run 
against data in Normalization Form D. (What the tag is at the perl level is 
up for grabs. I'd personally choose a D suffix)
3) Perl won't otherwise force any normalization on data already in Unicode 
format.
4) Data converted to Unicode (from ASCII, EBCDIC, one of the JIS encodings, 
or whatever) will be done into NFC.
5) Any character-based call (ord, substr, whatever) will deal with whatever 
code-points are at the location specified. If the string is LATIN SMALL 
LETTER A, COMBINING ACUTE ACCENT and someone does a substr($foo, 1, 1) on 
it, you get back the single character COMBINING ACUTE ACCENT, and an ord 
would return the value 796.
6) There will be a glyph boundary/non-glyph boundary pair of regex 
characters to match the word/non-word boundary ones we already have. (While 
I'd personally like \g and \G, that won't work as \G is already taken)
7) There will be a unicode package shipped standard with nfc() and nfd() 
calls to put things in normalization form C and D, respectively.
8) We will provide an I/O filter to convert into some unicode normalization 
form or other, as well as to convert to and from UTF8, 16, and 32. (Both 
big and little endian for UTF16 and 32, though perl internally will handle 
the 16 and 32 bit integers in whatever's native for the platform)

All of this is completely independent of whether the Unicode data is in 
UTF8, UTF16, UTF32, Morse Code, or trinary.

Yes, I realize that point 5 may result in someone getting a meaningless 
Unicode string. Too bad--it is *not* the place of a programming language to 
enforce validity on data. That's the programmer's job.

I also realize that the decomposition flag on regexes would mean that 
s/A/B/D would turn A ACUTE to B ACUTE, which is meaningless. See the 
previous paragraph.

To be really blunt, unless someone forsees the world coming to an end or 
becoming really annoying because of one of the above rules, I think that 
should put an end to discussions of "whether or not". Discussions of "how" 
are now in order.

                                        Dan

--------------------------------------"it's like this"-------------------
Dan Sugalski                          even samurai
[EMAIL PROTECTED]                         have teddy bears and even
                                      teddy bears get drunk

Reply via email to