Re: Plans for string processing

Leopold Toetsch Tue, 13 Apr 2004 15:50:28 -0700

Aaron Sherman <[EMAIL PROTECTED]> wrote:
> For example, in Perl5/Ponie:


>         @names=<NAMES>;
>         print "Phone Book: ", sort(@names), "\n";

> In this example, I don't see why I would care that NAMES might be a
> pseudo-handle that iterates over several databases, and returns strings
> in the 7 different languages

I already did show an example where uc("i") isn't "I". Collating is sill
more cmplex then a »simple« uc().

> More generally, an operation performed on a string (be it read
> (comparison) or write (upcase, etc)) should be done in the way that the
> *caller* expects,

Well, we dont't know what the caller expects. The caller has to decide.
There are basically at least two ways: Treat all strings language
independent (from their origin) or append more information to each
string.

>> *) Provides language-sensitive character overrides ('ll' treated as a
>> single character, for example, in Spanish if that's still desired)
>> *) Provides language-sensitive grouping overrides.

> Ah, and here we come to my biggest point of confusion.

Another example:

 "my dog Fiffi" eq "my dog Fi\x{fb03}"

When my program is doing typographical computations, above equation is
true. And useful. The characters "f", "f", "i" are goin' to be printed.
But the ligature "ffi" takes less space when printed as such.
This is the same character string, though, when I'm a reader of this dog
news paper.

When I do an analysis of counting "f"s in dog names, I don't care if
it's written in one of these forms, it should be the same - or when I
search for "ffi" in the text.

It just depends who's using these features in which context.

> I guess this boils down to two choices:

> a) All strings will have the user's language by default

> or

> b) Strings will have different languages and behave according to their
> "sources" regardless of the native rules of the user.

and/or either the strings or the users default come in depending on the
desired action.

>> IW: Mush together (either concatenate or substr replacement) two
>> strings of different languages but same charset

> According to whose rules?

User level - what do you want to achieve. At codepoint level the
operation is fine. It doesn't make sense above that, though.

> This means that someone's rules must become dominant,

It doesn't make much sense to do

   bors S0, S1   # stringwise bit not

to anything that isn't singlebyte encoded. It depends.

The rules - how and when they apply - still have to be layed out.

leo

Re: Plans for string processing

Reply via email to