Re: Unicode handling

Dan Sugalski Fri, 23 Mar 2001 11:52:34 -0800
At 11:52 AM 3/23/2001 -0800, Hong Zhang wrote:
> > >I recommend to use 'u' flag, which indicates all operations are performed
> > >against unicode grapheme/glyph. By default re is performed on codepoint.
> >
> > U doesn't really signal "glyph" to me, but we are sort of limited in what
> > we have left. We still need a zero-width assertion for glyph boundary
> > within regexes themselves.
>
>The 'u' flag means "advanced unicode feature(s)", which includes "always
>matching against glyph/grapheme, not codepoint". What it really means is
>up to discussion.  I think we probably still need "glyph" or "grapheme"
>boundary in some cases.

Fair enough. I think there are some cases where there's a base/combining 
pair of codepoints that don't map to a single combined-character code 
point. Not matching on a glyph boundary could make things really odd, but 
I'd hate to have the checking code on by default, since that'd slow down 
the common case where the string in NFC won't have those.

> > >We need the character equivalence construct, such as [[=a=]], which
> > >matches "a", "A ACUTE".
> >
> > Yeah, we really need a big list of these. PDD anyone?
>
>I don't think we need a big list here. The [[=a=]] is part of POSIX 1003.2
>regex syntax, also [[.ch.]]. Perl 5 does not support these syntax. We can
>implement in Perl 6.

That's a separate issue I think I'll dodge for right now.

                                        Dan

--------------------------------------"it's like this"-------------------
Dan Sugalski                          even samurai
[EMAIL PROTECTED]                         have teddy bears and even
                                      teddy bears get drunk
Re: Unicode handling

Reply via email to