Re: ICU incorporation and string changes heads-up

Jeff Clites Sat, 10 Apr 2004 12:21:39 -0700

On Apr 10, 2004, at 3:54 AM, Leopold Toetsch wrote:

Jeff Clites <[EMAIL PROTECTED]> wrote:

On Apr 10, 2004, at 1:12 AM, Leopold Toetsch wrote:

   use German;
   print uc("i");
   use Turkish;
   print uc("i");

Perfect example. The string "i" is the same in each case. What you've
done is implicitly supplied a locale argument to the uc()
operation--it's just a hidden form of:

uc(string, locale);


Ok. Now when the identical string "i" (but originating from different
locale environmets) goes through a sequence of string operations later,
how do you track the locale down to the final C<uc> where it's needed?

e.g.

    use German;
    my $gi = "i";
    use Turkish;
    my $ti = "i";

    my $s = $gi x 10;
    ...
    print uc($s);       # locale is what?

Where do you track the locale, if not in the string itself.

I think it's quite like file handles in perl5--there are 2 choices:

print OUT "foo"; # string is printed to file handle OUT
print "foo"; # string is printed to currently selected file handle

compare with:

uc($s, $locale); # string $s is uppercased using locale $locale
uc($s); # string is uppercased using current effective locale

I presume that "use German" would be equivalent to "set the current locale to German".

So again, locale is an implicit (or explicit) parameter to certain string operations, but not to string creation.

But let's say we did it the way you were thinking, and made locale part of the string. Consider this:

$string = $gi.$ti;

Now what locale would $string be in? It would be quite confusing.

Another way to state my point there is that locale definitely comes into play when certain operations are performed, and if you wanted to find the relevant locale by attaching it to a string, then things get instantly confusing when you need to do some operation involving two strings with different locales attached. To use my analogy from above, that would be similar to having the "currently selected file handle" be an attribute of a string.

[[Side note: Although uppercase/lowercase/titlecase are locale-dependent, there's also the separate notion of case-folding, which is locale-independent, and in a Unicode world is the convenient thing to use if you are just trying to discard case differences between two strings.]]

Hmm? The point is that if you have a list of strings, for instance some in English, some in Greek, and some in Japanese, and you want to sort them, then you have to pick a sort ordering.
Ok. I want to uppercase the strings - no sorting (yet). I've an array of Vienna's Kebab boothes. Half of these have turkish names (at least) the rest is a mixture of other languages. I'd like to uppercase this array of names. How do I do it?

You get to decide, for each, which locale to use for uppercasing, or you use the German locale (for instance) to uppercase them all. What you decide to do will depend on what your goal is--on what you are trying to achieve by uppercasing them.

If your goal (for instance) is to just case normalize so that you can look for duplicates in your list, then you can use case-folding and avoid the whole locale issue.

If you are having signs painted for the vendors, and you want the names all in uppercase (for style) but you want to make sure that the you are uppercasing a name in an appropriate way for that vendor's national origin, then on a per-string basis you need to decide on a locale. That might be a pain, but you'd have had the same pain if you wanted to attach the locale at string-creation time--you would have had to specify the locale for each one separately then as well. Now, let's say you had decided to concatenate the names into a single string, and uppercase that. What locale would you use? Once you start concatenating strings, the idea of attaching the locale to a string becomes unworkable.

[[Side note 2: There's can really be 2 different meanings of "attach a locale to a string": there is (1) locale is fundamentally a property of a string, and (2) let's hang a locale off of a string, for convenience. I'm saying that (1) is conceptually wrong, and (2) breaks down when you start concatenating strings.]]

one = "\N{LATIN CAPITAL LETTER A}\N{COMBINING ACUTE ACCENT}"
two = "\N{LATIN CAPITAL LETTER A WITH ACUTE}";

one eq two  //false--they're different strings
normalizeFormD(one) eq normalizeFormD(two)  //true


Sure. But if I want to compare "letters": one eq two. I think this is
the normal case the user of Unicode wants or expects. On the surface it
doesn't matter if the internal representation is different

Right, that's fine, and I'm saying that the different "levels" boil down to different behaviors of an equality operator, not different types of string. Again it's quite analogous to things already in Perl5. For instance, we have "eq" v. "==", and the decision on whether to use string v. numeric comparison is decided by the caller, not automatically determined based on the contents of the variables involved. So taking your example strings, we'd have two possible approaches:

use level 1;
one eq two; //false

use level 2;
one eq two; //true

or instead, this approach could be taken:

one eq two; //false--this is a level-1 comparison
one linguisticallyEq two; //true--this is a level two comparison
one caseInsensitiveEq two; //a different sort of "semantic" comparison

But with either approach, it boils down to deciding what comparison algorithm to use, and the decision is based on something other than the contents of the strings.

OTOH normalizing all strings on input is not possible - what if they
should go into a file in unnormalized form.

Sure, of course--especially because there are at least 4 common normalization forms, and you are 100% correct that which one to apply (if any) is a decision a would be make by a programmer on a per-string basis, depending on what they are trying to do.

This is quite analogous to:
three = "abc"
four = "ABC"
No.

Yes! :)

It actually illustrates something a bit different. Really:

three caseInsensitiveEq four

is equivalent to

caseFold(three) eq caseFold(four) # same as uc(three)... in an ASCII-only world

That is, different styles of string comparison end up being equivalent to literal string comparison applied after some normalization process.

So there are choices as to how to expose this at a language level (esp. an HLL level), but I'm thinking that different styles of string comparison will (internally) correspond to different types of normalization, not to different types of strings.

[[Side note 3: I think it will be really confusing if Perl6 has a single "eq" operator, whose behavior depends on the "current level", rather than either different operators or operators which take an additional parameter. But that's a language-level design question, and parrot can handle either approach.]]

I can't imagine that. I've an ASCII string and want to convert it to
UTF8
and UTF16 and write it into a file. How do I do that?

That's the mindset shift. You don't have an ASCII string. You have a
string, which may have come from a file or a buffer representing a
string using the ASCII encoding. It's the example from above, again:

inputBuffer = read(inputHandle);
string = string_make(inputBuffer, "ASCII");
outputBuffer = encode(string, "UTF-16");
write(outputHandle, outputBuffer);

Ok. I should have asked: How do I do that in PASM of course.

I'm envisioning something like this:

open P0, "file", "<"
read P1, P0                     # P1 is a byte-buffer PMC
new S1, P1, "ASCII"
encode P2, S1, "UTF-16" # P2 is a byte-buffer PMC
open P3, "file2", ">"
print P3, P2

For this, we need encode/decode ops, as well as a PMC to represent "raw" bytes.

The other variant might look like this:

open P0, "file", "<", "ASCII" # or whatever syntax we decide, maybe "<:ASCII" read S1, P0 # S1 is a string since P0 knows what encoding to use open P3, "file2", ">", "UTF-16" print P3, S1 # P3 knows what encoding to use

I think we want both variants to be available, be in the IO API we need to support two styles of IO handles--one which reads and writes bytes, and one which reads and writes strings (and must, therefore, have an encoding attached to the handle). This might be done via the IO layer approach (you could push a string-ifying layer onto the stack), or maybe this would be what IO "filters" are for--I've seen the term mentioned in some of the docs, but I've not yet asked what a filter is supposed to be.

Oh, and another option would be:

open P0, "file", "<"
read S1, P0, "ASCII"          # the IO operation specifies the encoding
open P3, "file2", ">"
print P3, S1, "UTF-16"                # ditto

I think that option 1 is clear/explicit, and option 2 is convenient (but less powerful), and option 3 is reasonable but somehow less appealing. I think options 1 and 2 would co-exist nicely.

Keep the questions coming!

JEff

Re: ICU incorporation and string changes heads-up

Reply via email to