On Monday 11 June 2001 04:54 pm, Dan Sugalski wrote:
> >Would it, or should it, be possible to tell m// to treat Katakana
> >characters as the same as hiragana characters, in much the same way as
> >m//i treats UPPERCASE the same as lowercase? Canonicalization won't get
> >you that.
>
> Yup, that's pretty much it in a nutshell. This may end up being a
> Japanese-only thing, in which case it may not be worth much effort, but it
> seems as useful in some cases as the case-insensitivity we do for other
> character sets.
Not sure if it's this is even the same ballpark, but Arabic has the tah/tah
marbuta duality.
I suppose a similar (and by similar, I mean completely different) concept
might be the tatwiil, which has a codepoint (0x0640), but isn't a letter at
all. (It's similar to kerning, but lengthens joined characters instead of
adjusting white-space. Did I say this already? Hmmm....)
>From a data perpective, "0x062C" (ARABIC LETTER JEEM) ne "0x062C0x0640"
(ARABIC LETTER JEEM followed by ARABIC LETTER TATWEEL), but from a text
perspective, "0x062C" (ARABIC LETTER JEEM) eq "0x062C0x0640" (ARABIC LETTER
JEEM followed by ARABIC LETTER TATWEEL) eq "0x062C0x06400x0640...." (ARABIC
LETTER JEEM followed by n number of ARABIC LETTER TATWEELs.) Arabic also
has voweling or no voweling to worry about. :-(
This gets back to the previous discussion of where is the line drawn? (An
easier way of looking at it may be, how do you ask for everything? In my
example, if you ignore tatwiils, how do you go after an exact match?) I'm
guessing that each locale will add hooks for the things unique to that
language.
/i for case-insensitivity in English, Arabic may (want to) usurp for "ignore
vowelings", for instance. (I'm guessing. I keep waiting for someone to
tell me to uskuut.)
/k may be ignored (or carped) for regular regular expressions, but may
signal the Japanese handler to handle the dual matching. And it may signal
something completely different to some other language.
The hooks don't have to be that complicated, or even exist, since it seems
that they are simply being done for the convenience of the user. (To
simplify regexes.)
--
Bryan C. Warnock
[EMAIL PROTECTED]