Re: More character matching bits

Bryan C . Warnock Mon, 11 Jun 2001 14:51:27 -0700
On Monday 11 June 2001 04:54 pm, Dan Sugalski wrote:
> >Would it, or should it, be possible to tell m// to treat Katakana
> >characters as the same as hiragana characters, in much the same way as
> >m//i treats UPPERCASE the same as lowercase?  Canonicalization won't get
> >you that.
>
> Yup, that's pretty much it in a nutshell. This may end up being a
> Japanese-only thing, in which case it may not be worth much effort, but it
> seems as useful in some cases as the case-insensitivity we do for other
> character sets.

Not sure if it's this is even the same ballpark, but Arabic has the tah/tah 
marbuta duality.

I suppose a similar (and by similar, I mean completely different) concept 
might be the tatwiil, which has a codepoint (0x0640), but isn't a letter at 
all.  (It's similar to kerning, but lengthens joined characters instead of 
adjusting white-space.  Did I say this already?  Hmmm....)   

>From a data perpective, "0x062C" (ARABIC LETTER JEEM) ne "0x062C0x0640" 
(ARABIC LETTER JEEM followed by ARABIC LETTER TATWEEL), but from a text 
perspective, "0x062C" (ARABIC LETTER JEEM) eq "0x062C0x0640" (ARABIC LETTER 
JEEM followed by ARABIC LETTER TATWEEL) eq "0x062C0x06400x0640...." (ARABIC 
LETTER JEEM followed by n number of ARABIC LETTER TATWEELs.)  Arabic also 
has voweling or no voweling to worry about.  :-(

This gets back to the previous discussion of where is the line drawn?  (An 
easier way of looking at it may be, how do you ask for everything?  In my 
example, if you ignore tatwiils, how do you go after an exact match?)  I'm 
guessing that each locale will add hooks for the things unique to that 
language.  

/i for case-insensitivity in English, Arabic may (want to) usurp for "ignore 
vowelings", for instance.  (I'm guessing.  I keep waiting for someone to 
tell me to uskuut.)

/k may be ignored (or carped) for regular regular expressions, but may 
signal the Japanese handler to handle the dual matching.  And it may signal 
something completely different to some other language.

The hooks don't have to be that complicated, or even exist, since it seems 
that they are simply being done for the convenience of the user.  (To 
simplify regexes.)

-- 
Bryan C. Warnock
[EMAIL PROTECTED]
Re: More character matching bits

Reply via email to