At 11:04 PM +0300 4/21/04, Jarkko Hietaniemi wrote:
>
 We need to address that, then. If we're doing
 unicode, we damn well need to do it right--å is
 å, regardless of whether it's composed or
 decomposed.

Agreed -- on some level. But If we want to implement Larry's :u0 (bytes) and :u1 (code points) levels we need to have also the "more raw" comparisons available, somehow. (I do not remember whether Larry specified would :u2 do by default some of the Unicode normalizations, thus doing (de)compositions.)

We'll work that out when the perl 6 compiler gets to that point. For Parrot, my preference (unless ICU makes it infeasable, which I doubt) is to keep everything decomposed. I hear rumor that way's preferred... :)


> If people want low-level binary comparisons (and
 generally we *shouldn't* for  most things) then
 they'll need to force the string to binary.

And I'm not certain whether "forcing to binary" is the right visual image or approach here. Maybe we need some sort of "pragma" support so that we can tweak the ":u level"? The default level could well be :u2, the highest we can do without picking some "language" rules.

I've got a Cunning Plan, oddly enough, though the margins of this e-mail are too small to contain it. As soon as I get it finished I'm going to pass it onto the list and to a few non-list folks who I know are deep into this stuff (Autrijus and Dan Kogai, if I can get in touch. I *really* wish I had someone who did mainly Korean text processing handy...) and we'll see where we go from there. I have no doubt it'll be... fun. Yeah, that's the word, fun!
--
Dan


--------------------------------------"it's like this"-------------------
Dan Sugalski                          even samurai
[EMAIL PROTECTED]                         have teddy bears and even
                                      teddy bears get drunk

Reply via email to