RE: Unicode handling

Garrett Goebel Fri, 23 Mar 2001 08:47:47 -0800
From: Nicholas Clark [mailto:[EMAIL PROTECTED]]
> 
> On Thu, Mar 22, 2001 at 04:10:28PM -0500, Dan Sugalski wrote:
> > 1) All Unicode data perl does regular expressions against 
> >    will be in Normalization Form C, except for...
> > 2) Regexes tagged to run against a decomposed form will 
> >    instead be run against data in Normalization Form D.
> >   (What the tag is at the perl level is  up for grabs. I'd
> >   personally choose a D suffix)
> > 3) Perl won't otherwise force any normalization on data 
> >    already in Unicode format.
> 
> So if I understand that correctly, running a regexp against a 
> scalar will cause that scalar to become normalized in a
> defined way (C or  D, depending on regexp)

I'm not sure whether to read that as resulting in scalar being normalized,
or if the "data perl does the regular expressions against" would be a
normalized copy of that scalar's value.

Wouldn't normalizing the scalar lose information? I don't know Unicode, but
surely someone must have a use for storing strings in both NFC and NFD. Is
it valid to intermix both forms? Isn't there a need to preserve the data in
its original encoding? I don't like the idea of the language losing
information without the programmer's permission.


> > 5) Any character-based call (ord, substr, whatever) will 
> >    deal with whatever code-points are at the location
> >    specified. If the string is LATIN SMALL LETTER A, 
> >    COMBINING ACUTE ACCENT and someone does a 
> >    substr($foo, 1, 1) on it, you get back the single
> >    character COMBINING ACUTE ACCENT, and an ord would
> >    return the value 796.
> 
> So if you do (ord, substr, whatever) on a scalar without 
> knowing where it has been, you have no idea whether you're
> working on normalised or not. And in fact the same scalar
> may be come denormalised:
> 
>   $bar = substr $foo, 3, 1;
>   &frob ($foo);
>   $baz = substr $foo, 3, 1;

Hmm... if I put on my "everything is an object in Perl 6" blinders, wouldn't
that be:

$foo : utf8d = "timtowtdi"; 
$bar : utf8  = substr $foo, 3, 1;
$baz : char8 = substr($foo,0,3) . substr($bar,3,3) . "tdi";

o  $foo would be normalized to NFD
o  substr would know what $foo is and operate on it per NFD
o  $bar would be normalized to NFC.
o  $baz would work with byte characters indeterminantly

i.e., substr, ord, length would DWIM based on what type of string it is.


>  $foo =~ /^$bar$/;    # did I need to \Q \E this?
>
> might be true at the same time as
> 
>  $foo ne $bar

Have the match operate on a copy $bar normalized to whatever $foo is.


> I'm in too minds about this. It feels like it would be hard 
> to implement the internals to make eq work on normalized
> forms without either
> 
> 1: causing it to not be read only, hence UTF8 in might not be UTF8 out
>    because it had been part of an eq
> 
> or
> 
> 2: having to double buffer almost every scalar, with both the 
> original UTF8
>    and a (cached copy) normalized form

I really don't want to see #1. Do my naive suggestions get around #2?

Garrett
RE: Unicode handling

Reply via email to