On Thu, Mar 22, 2001 at 04:10:28PM -0500, Dan Sugalski wrote:
> 1) All Unicode data perl does regular expressions against will be in
> Normalization Form C, except for...
> 2) Regexes tagged to run against a decomposed form will instead be run
> against data in Normalization Form D. (What the tag is at the perl level is
> up for grabs. I'd personally choose a D suffix)
> 3) Perl won't otherwise force any normalization on data already in Unicode
> format.
So if I understand that correctly, running a regexp against a scalar will
cause that scalar to become normalized in a defined way (C or D, depending
on regexp)
> 5) Any character-based call (ord, substr, whatever) will deal with whatever
> code-points are at the location specified. If the string is LATIN SMALL
> LETTER A, COMBINING ACUTE ACCENT and someone does a substr($foo, 1, 1) on
> it, you get back the single character COMBINING ACUTE ACCENT, and an ord
> would return the value 796.
So if you do (ord, substr, whatever) on a scalar without knowing where it has
been, you have no idea whether you're working on normalised or not.
And in fact the same scalar may be come denormalised:
$bar = substr $foo, 3, 1;
&frob ($foo);
$baz = substr $foo, 3, 1;
[so $bar and $baz differ] if someone runs it against a regular expression
[in this case inside the subroutine &frob. Hmm, but currently you can
make changes to parameters as they are pass-by-reference]
$bar = substr $foo, 3, 1;
$foo =~ /foo/; # This is not read only in perl6
$baz = substr $foo, 3, 1;
But this is documented - if you want (ord, substr, whatever) on a string
to make sense, you must explicitly normalized it to the form you want before
hand, and not use any of the documented-as-normalizing operators on it
without normalizing it again.
And by implication of the above (particularly rule 3), eq compares
codepoints, not normalized forms.
Hmm. So
$foo =~ /^$bar$/; # did I need to \Q \E this?
might be true at the same time as
$foo ne $bar
I'm in too minds about this. It feels like it would be hard to implement the
internals to make eq work on normalized forms without
either
1: causing it to not be read only, hence UTF8 in might not be UTF8 out
because it had been part of an eq
or
2: having to double buffer almost every scalar, with both the original UTF8
and a (cached copy) normalized form
but at this point I'll shut up as I expect I'm ignorant of an RFC on how this
works without hitting either of the above problems.
Nicholas Clark