At the moment, I'm not particularly inclined to argue unicode. Short of
Larry handing down an edict and invoking Rule #1, the following rules will
be in effect:
1) All Unicode data perl does regular expressions against will be in
Normalization Form C, except for...
2) Regexes tagged to run against a decomposed form will instead be run
against data in Normalization Form D. (What the tag is at the perl level is
up for grabs. I'd personally choose a D suffix)
3) Perl won't otherwise force any normalization on data already in Unicode
format.
4) Data converted to Unicode (from ASCII, EBCDIC, one of the JIS encodings,
or whatever) will be done into NFC.
5) Any character-based call (ord, substr, whatever) will deal with whatever
code-points are at the location specified. If the string is LATIN SMALL
LETTER A, COMBINING ACUTE ACCENT and someone does a substr($foo, 1, 1) on
it, you get back the single character COMBINING ACUTE ACCENT, and an ord
would return the value 796.
6) There will be a glyph boundary/non-glyph boundary pair of regex
characters to match the word/non-word boundary ones we already have. (While
I'd personally like \g and \G, that won't work as \G is already taken)
7) There will be a unicode package shipped standard with nfc() and nfd()
calls to put things in normalization form C and D, respectively.
8) We will provide an I/O filter to convert into some unicode normalization
form or other, as well as to convert to and from UTF8, 16, and 32. (Both
big and little endian for UTF16 and 32, though perl internally will handle
the 16 and 32 bit integers in whatever's native for the platform)
All of this is completely independent of whether the Unicode data is in
UTF8, UTF16, UTF32, Morse Code, or trinary.
Yes, I realize that point 5 may result in someone getting a meaningless
Unicode string. Too bad--it is *not* the place of a programming language to
enforce validity on data. That's the programmer's job.
I also realize that the decomposition flag on regexes would mean that
s/A/B/D would turn A ACUTE to B ACUTE, which is meaningless. See the
previous paragraph.
To be really blunt, unless someone forsees the world coming to an end or
becoming really annoying because of one of the above rules, I think that
should put an end to discussions of "whether or not". Discussions of "how"
are now in order.
Dan
--------------------------------------"it's like this"-------------------
Dan Sugalski even samurai
[EMAIL PROTECTED] have teddy bears and even
teddy bears get drunk