Re: Strings Manifesto

Jeff Clites Mon, 03 May 2004 08:08:23 -0700

On Apr 28, 2004, at 10:59 AM, Larry Wall wrote:

All in all, very well written.

Thanks.

I do, of course, have a few quibbles:

On Wed, Apr 28, 2004 at 04:22:07AM -0700, Jeff Clites wrote: : As it turns out, people find it convenient to programmatically represent a : character by an integer (think "whole number", not a specific data type : here).
After being so careful to define "character" abstractly, this whole
passage misleads the reader into believing that any such abstract
character can be represented by a single integer (code point).
Only a subset of characters can be represented by a single code point.
Many characters require multiple code points.

No, this is 100% intentional, and not meant to be a pedagogical fib, actually. I'm saying that an entry if the big table in the Unicode Standard is describing a character. This is in fact consistent with the definition of "abstract character" put forth in the Unicode Standard.

They make the careful point that a "character" isn't necessarily what a naive language user would see as a character (and they call the latter a "grapheme"). That said, I think that this concept of a character really is in fact trying to capture an intuitive, natural-language concept. But the Unicode character repertoire is a product of compromises, and of the desire to maintain backward compatibility with previous standards. For instance, there's a separate entry for the Angstrom Sign v. Latin Capital Letter A with Ring Above. Ideally, these wouldn't be distinguished. They are, because they are in Shift-JIS, and it was desirable to be able to round-trip between important national standards and Unicode-defined encodings. And even conceptually that's not that bad--presumably, Shift-JIS distinguished between the two because someone thought of them as semantically different. But examples such as this don't imply that a code point is trying to pick out a different concept than a character, but just that in some cases things may not mesh with a particular person's intuition.

Also, I don't have a problem with the formalization of a concept breaking with its informal usage in spots. For instance, floating point numbers and integers are supposed to model the mathematical notion of a number, but the former are bounded in range and precision, whereas the latter concept is not. But that's just a practical shortcoming--not a whole separate concept.

I see this as a critical point--it's at the one-to-many interfaces that things tend to break, and that's precisely why Perl 6 has the four abstraction levels it does:
    Level 0: bytes
    Level 1: codepoints don't fit into bytes
    Level 2: graphemes don't fit into codepoints
    Level 3: characters don't fit into graphemes
(where I've used the term "characters" in the language-sensitive sense.)

I see there as ideally being just two levels:

First off, I'd say bytes don't have anything to do with a (text) string, as in in-memory data type. You can serialize a string into bytes, but you can serialize a hash into bytes as well. It's not productive to have a byte-based view of either. You can't ask, "what's the first byte of this string", any more than you can ask "what's the first byte of this hash". But you can ask, "if this string were serialized using the UTF-8 encoding, what would the first byte of the result be", just as you could ask, "if this hash were serialized using Data::Dumper, what would the first byte of the result be".

So I worry about your level 0, because it promotes the idea that a string is semantically byte-based. Even more importantly, if a byte-based operation is intended to on-the-fly serialize a string using some default encoding, manipulate the result, and then re-create a string from those bytes, then in the UTF-8 case you're quite often going to end up with a byte sequence which can't be de-serialized into a string. So you'll just end up shredding your string if you try to do byte-based regex replacements.

(That said, in the approach I'm pushing for of having a separate data type to hold "raw bytes", perhaps ByteArray, it makes perfect sense to allow some regexes to be applied to that--essentially searching for certain byte sequences.)

Level 1 is what I'm calling characters--no problem there semantically.
Level 2 is graphemes--sequences of characters. Okay-ish.

Level 3 I don't think should be a different level, though I'm not certain I 100% understand what you have in mind. To my way of thinking, a grapheme is basically that which a language user would think of as a "character". The Unicode Standard defines a language-agnostic concept of a grapheme (sort of a general consensus across langauges), plus the concept of language-specific refinements. So I'd say that picking a language lets you refine what sequences count as graphemes, but doesn't pick out an entire separate concept.

So it feels to me like there should be per-character and per-grapheme operations--like we just need two levels. But I need to give this area some more thought--there's something a bit slippery about counting graphemes.

Not making this distinction also causes you to leave out a level
of collation:

Level 0: binary sorting

This binary sorting is how you have to sort ByteArrays, but not how you'd naturally sort strings.

Level 1: codepoint sorting

And I think you could have variations here--sorting by numerical code point order, and sorting as though you're strings were in various normalization forms (C, D, KC, KD).

    Level 2: language-independent grapheme sorting (UCA)
    Level 3: UCA plus tailorings

UCA and UCA-plus tailorings are two choices, but I don't think they are really two "levels" of sorting.

I think the sorting choices are very much like sorting using different comparison operators--not inherently "levels", but really chosen on a per-sort basis. All-in-all, sorting seems more straightforward then regex matches, since we already had the concept of there being different sort flavors.

I think we should beware of being overly-contextual--that is, of thinking of these things as "levels" or "modes", to be specified via "use" directives, rather than clearly specified on a per-operation bases. That is, ideally I could look at a line of code and know what it would do, without having to look higher in the code for "use" directives. (But specifying the "mode" as part of a regex fragment would work nicely, if that's the basic idea.)

: It's convenient for several reasons--it's compact and easy to : refer to in speech. And if the fundamental thing you can ask a string : is what its Nth character is, then the fundamental things you do with a : character is look up its properties, and test it for equality against : other characters. So if you just go through and give each character a : little serial number, then you can find the properties of a character : by using its number as an index into a property table (i.e., character : 3's properties are at slot number 3 in the table), and you can tell : that 2 characters are different characters by checking whether they are : represented by different numbers.
But this is really only true of codepoints, not of graphemes or
characters.

The key here is that your usage of these terms diverges from mine, and from the Unicode Standard's. To me (and to the Unicode Consortium, by my reading of the standard), a "code point" is a number representing an (abstract) character. A grapheme is a sequence of one or more abstract characters, intended to correspond to some natural-language concept of a single unit of text.

I realize that oversimplifying is a useful pedagogical
technique, but when you do that you ought to "unlie" in the same
document somewhere.  (I'll grant you that you promise to unlie in
your final paragraph, kinda sorta.)

As I said above, I do actually mean what I was saying literally, though the parts I haven't gotten to would have hit that point home.

: Fortunately, the Unicode Standard has numbered *all* of them--it's
: given a number to essentially every character in every
: digitally-represented langauge in the world.

Um, no--not unless you've defined how to multiplex the multiple
integers of the codepoints in a grapheme into a single integer, and
I haven't heard that the Unicode consortium has come up with such
a definition.

No, exactly--they've numbered characters, not graphemes. This point in fact brings up a worry I have about your grapheme-level of semantics. I think it makes perfect sense to say that two strings are graphemically equivalent (despite being composed of different characters), but it gets dicey to say they're made of the "same graphemes". Saying that implies that you have a data type to uniquely represent "a grapheme", and there isn't a convenient data type for that (unless someone goes and numbers all possible ones).

: So, let's review again. For various practical reasons, it's preferable : to programatically represent characters using integers, you have to : pick an arbitrary numbering scheme, and somebody's done that, and it's : a good one. This numbering scheme defines a one-to-one correspondence : between numbers (code points) and characters,
There you go again.  You need to settle on one definition of character
or the other.

Yep, I did, but I think your brain is rejecting it because it's not the definition you expected.

I'm very much avoiding a fuzzy definition of character as something like, "what a user would generally see as a single thing".

I kind of like the abstract definition, but that's not how you're using it here.

The definition is that someone went through the abstract choices and made some specific judgment calls, and decided what abstract notions to distinguish between, and what not do. Then, they recorded their decisions in a big table.

To use my example from before, I don't think one can give a knock-down-drag-out argument that "a" and "A" are not just stylistic variants of the same characters, and are in fact two different characters. You could certainly think of them that way. But, once you've made a decision one way or the other about how to look at them, you've layed down part of a precise definition of a general, fuzzy concept.

: and that makes it : tempting to pretend that characters *are* numbers. But it's important : to keep in the back of your mind an awareness that the numbers merely : help you pick out the characters, and it's the characters themselves : which are important, and characters are *abstract*--they never actually : live inside of a computer program.

Cain't have it both ways...

It's an isomorphism. Someone picked the list of characters, then numbered them. It doesn't matter that "A" was given number 65, but since it was, it's unambiguous to say, "the character to which Unicode gives the number 65", or even just the shorter "character 65". But I think in general it's important to remember, even when speaking like that, that a code point is literally a number, which represents something non-numerical (a character).

[Note: Of course, some numbers don't
: represent any character--there are only so many characters. So to be
: mathematically precise, there's a one-to-one correspondence between a
: subset of integers and all characters.]

And many characters are not represented by any integer, but by a sequence of integers.

Not in my definition, or in Unicode's. I'd state this as, "Many graphemes are not represented by any integer, but by a sequence of integers".

For instance, see <http://www.unicode.org/reports/tr29/>, which states:

        One or more Unicode characters may make up what the user thinks of
        as a character or basic unit of the language. To avoid ambiguity
        with the computer use of the term character, this is called a
        grapheme cluster. For example, “G” + acute-accent is a grapheme
        cluster: it is thought of as a single character by users, yet is
        actually represented by two Unicode code points.

: Also, importantly, a grapheme cluster is a notion built on top of : characters (it's a cluster of characters), and choosing a langauge lets : you refine how you break up a string into grapheme clusters, but it's : just a refinement--"adding a language into the mix" doesn't pick out a : different semantic construct, it just help you customize your choice of : what ranges make up single graphemes.

I'd say a grapheme cluster functions as a "character" by your original definition, so this is another case where you're using "character" to mean something less than that. Also the last sentence seems to be calling a grapheme cluster a grapheme, which is confusing. A grapheme cluster is a cluster of graphemes, kinda by definition...

Nope, they're synonyms--a "grapheme cluster" is a preferred term for this usage of "grapheme", to distinguish it from the linguistic usage. For instance, the above-referenced document states:

        In previous documentation, default grapheme clusters were previously
        referred to as “locale-independent graphemes”. The term cluster has
        been added to emphasize that the term grapheme as used differently
        in linguistics.

I assume what they had in mind was something akin to, "a grapheme-forming cluster (of characters)". But it is a bit muddled--their usage is not entirely consistent.

(And shame on the Unicode Consortium for picking this term for this concept, and then disambiguating in a confusing way.)

My goal in all of this is to provide concrete rather than fuzzy definition where at all possible. The one-to-one mapping between code points and characters make a lot of sense to me--not only do I believe that is what the Unicode Consortium intended (despite compromises), but if one tries to say "no no, code points don't correspond to characters, groups of them do", then you're left wondering just what sort of thing the Unicode Consortium when through and numbered.

And I will have some more to say about why it doesn't really bother me that <o-with-acute-accent> and <o, combining-accute-accent> are two _different_ character sequences which are graphemically equivalent (or, which are inequal but equivalent under various notions of equivalence).

: I haven't yet covered a few important topics, such as different
: character sequences representing equivalent graphemes, canonical and

s/character/codepoint/

: compatability equivalence, and Unicode normalization forms. I also
: haven't said anything yet about concrete implementation or API
: guidelines.

I await your coverage of those topics with interest.

Thanks. I need to write that up soon; I suppose I'll post it to p6l, as that seems more appropriate at this point.

JEff

Re: Strings Manifesto

Reply via email to