> > "abc".chars  would return <a b c>, which I'm guessing would be
> > bytesize usually.
> 
> Fair enough.
> 
> > "日本語".chars would return <[EMAIL PROTECTED]@語>, which can probably be 
> > expressed with
> > UTF8?
> 
> I think you're confusing UTF8 (which can represent ALL Unicode
> characters) and "the UTF8 subset which consists of one-byte
> representations" (which happens to overlap with 7-bit ASCII).

Perhaps my confusion is that I thought, perhaps wrongly, that since
.chars returns a count that is appropriate for the given unicode
level, that would mean that if it were able to return a list in list
context then it would be with the right storage size as needed for the
given string contents. For instance, <a b c> just requires bytes for
each element, while Kanji would require more. I'm leaving very wide
room open here for me really misunderstanding how all this works.

> 
> > >From Apocalyps 5: "Under level 2 Unicode support, a character
> > isassumed to mean a grapheme, that is, a sequence consisting of a
> > basecharacter followed by 0 or more combining characters."
> > Marcus
> 
> Hmmm... that doesn't answer the ligature question clearly though. That
> answers for the case of combining diacritical marks:

I read "followed by 0 or more combining characters" to mean that it is
smart enough to combine the vowels in Arabic and other syllabic
alphabets that use special conjuncts. However I'm also not exactly
sure if that's even reasonably possible, or even if it makes sense in
the counting of "characters" for languages that use those.

Reply via email to