Re: String Theory

Rod Adams Sat, 19 Mar 2005 23:05:19 -0800

Larry Wall wrote:

You've more or less described the semantics available at the "use
bytes" level, which basically comes down to a pure OO approach where
the user has to be aware of all the types (to the extent that OO
doesn't hide that).  It's one approach to polymorphism, but I think
it shortchanges the natural polymorphism of Unicode, and the approach
of Perl to such natural polymorphisms as evident in autoconversion
between numbers and strings.  That being said, I don't think your
view is so far off my view.  More on that below.

[ rest of post snipped, not because it isn't relevant, but because it's long 
and my responses don't match any single part of it. -- RHA ]

What I see here is a need to define what it means to coerce a string from one level to another.

First let me lay down my understanding of the different levels. I am towards the novice end of the Unicode skill level, so it'll be pretty basic.

At the "byte" level, all you have is 8 bits, which may have some meaning as text if treat them like ASCII.

You can take one or more bytes at a time, lump them together in a predefined way, and generate a Code Point, which is an index into the Unicode table of "characters".

However, Unicode has problem with what it assigns code points to, so you have one or more code points together to form a proper character, or grapheme.

But Unicode has another problem, where certain graphemes mean very different things depending on what language you happen to be in. (Mostly a CJK issue, from what I've read.) So we add a language dependent level, which is basically graphemes with an implied language.

Even if I got parts of that wrong (very possible), the main point is that in general, a higher level takes one _or_more_ units of the level below it to construct a unit at it's level.

So now, there's the question of what it means to move something from one level to another.

We'll start with moving "up" to a higher level. I'll use the example of moving from Code Points (cpts) to Graphemes (grfs), but the talk should translate to other conversions.

There are two approaches I see to this: 1) Convert every cpt into an exactly equivalent grf. The "length" of the strings are equal. 2) Scan through the string, grouping cpts into associated grfs as possible. The resulting string "length" is less than or equal to the input. In short, attempt to keep the same semantic meaning of the word.

I see both methods as being useful in certain contexts, but #2 is likely what people want more often, and is what I have in mind.

Going "down" the chain, you stand the possibility of losing information in method #1. However, using #2, you simply "expand" the relevant grfs into the associated group of cpts.

My general approach of how to convert a string from one level to another is to pick an encoding both levels understand, generate a bitstring from the old level, and then have the new level parse that bitstring into it's level. If the start and goal don't allow this, throw an error.

I'm not certain how your views relate to this all this, but I was left with the impression that you were talking about conversions of type #1, which would make sense to outlaw downward conversions, since it's possible the grf won't "fit" into a cpt.

It would also make sense that you have an "allowable levels" parameter in such a scheme, so you know not to store a grf that can't also be cpt, or at least to track that after one does it, they can't go back to cpts.

Taking a step back, perhaps I didn't make it clear (or even mention) that my coercions were DWIMish in nature, not pure bit level unions. I covered String to String coercions above. For String -> Array, what happens depends on the type of the array. For String -> Array of Characters (back to my role), each element of the array corresponds to a single of what the string thought a character was. However, String -> Array of u?int\d+ would do bit level operations, and the encoding scheme would matter greatly in this case.

We/I will have to come up with a table of what these DWIMish operations are, and how a user could define a new one. That likely will be an extension of how you decide "tie" should happen in Perl 6.

I also see nothing wrong with most operations between strings of two levels autocoercing one string to the higher level of the other. Things like C<cmp>, C<~>, and many others should be fine in this regard, as long as they default to coercing "up". I soloed C<index> out, because it deals with two strings *and* it deals with positions within those strings, and what a given integer position means can vary greatly with level. But even there I suppose that we could force the target's level onto the term, and make all positions relative to the target, and it's level.


As for the exact syntax of the coercion, I'm open to suggestions.

-- Rod Adams

Re: String Theory

Reply via email to