Re: Unicode handling

Nicholas Clark Thu, 22 Mar 2001 15:23:33 -0800
On Thu, Mar 22, 2001 at 04:10:28PM -0500, Dan Sugalski wrote:
> 1) All Unicode data perl does regular expressions against will be in 
> Normalization Form C, except for...
> 2) Regexes tagged to run against a decomposed form will instead be run 
> against data in Normalization Form D. (What the tag is at the perl level is 
> up for grabs. I'd personally choose a D suffix)
> 3) Perl won't otherwise force any normalization on data already in Unicode 
> format.

So if I understand that correctly, running a regexp against a scalar will
cause that scalar to become normalized in a defined way (C or D, depending
on regexp)

> 5) Any character-based call (ord, substr, whatever) will deal with whatever 
> code-points are at the location specified. If the string is LATIN SMALL 
> LETTER A, COMBINING ACUTE ACCENT and someone does a substr($foo, 1, 1) on 
> it, you get back the single character COMBINING ACUTE ACCENT, and an ord 
> would return the value 796.

So if you do (ord, substr, whatever) on a scalar without knowing where it has
been, you have no idea whether you're working on normalised or not.
And in fact the same scalar may be come denormalised:

  $bar = substr $foo, 3, 1;
  &frob ($foo);
  $baz = substr $foo, 3, 1;

[so $bar and $baz differ] if someone runs it against a regular expression
[in this case inside the subroutine &frob. Hmm, but currently you can
make changes to parameters as they are pass-by-reference]

  $bar = substr $foo, 3, 1;
  $foo =~ /foo/;        # This is not read only in perl6
  $baz = substr $foo, 3, 1;

But this is documented - if you want (ord, substr, whatever) on a string
to make sense, you must explicitly normalized it to the form you want before
hand, and not use any of the documented-as-normalizing operators on it
without normalizing it again.

And by implication of the above (particularly rule 3), eq compares
codepoints, not normalized forms.
Hmm. So

 $foo =~ /^$bar$/;      # did I need to \Q \E this?

might be true at the same time as

 $foo ne $bar

I'm in too minds about this. It feels like it would be hard to implement the
internals to make eq work on normalized forms without
either

1: causing it to not be read only, hence UTF8 in might not be UTF8 out
   because it had been part of an eq

or

2: having to double buffer almost every scalar, with both the original UTF8
   and a (cached copy) normalized form

but at this point I'll shut up as I expect I'm ignorant of an RFC on how this
works without hitting either of the above problems.

Nicholas Clark
Re: Unicode handling

Reply via email to