> > "abc".chars would return <a b c>, which I'm guessing would be > > bytesize usually. > > Fair enough. > > > "日本語".chars would return <[EMAIL PROTECTED]@語>, which can probably be > > expressed with > > UTF8? > > I think you're confusing UTF8 (which can represent ALL Unicode > characters) and "the UTF8 subset which consists of one-byte > representations" (which happens to overlap with 7-bit ASCII).
Perhaps my confusion is that I thought, perhaps wrongly, that since .chars returns a count that is appropriate for the given unicode level, that would mean that if it were able to return a list in list context then it would be with the right storage size as needed for the given string contents. For instance, <a b c> just requires bytes for each element, while Kanji would require more. I'm leaving very wide room open here for me really misunderstanding how all this works. > > > >From Apocalyps 5: "Under level 2 Unicode support, a character > > isassumed to mean a grapheme, that is, a sequence consisting of a > > basecharacter followed by 0 or more combining characters." > > Marcus > > Hmmm... that doesn't answer the ligature question clearly though. That > answers for the case of combining diacritical marks: I read "followed by 0 or more combining characters" to mean that it is smart enough to combine the vowels in Arabic and other syllabic alphabets that use special conjuncts. However I'm also not exactly sure if that's even reasonably possible, or even if it makes sense in the counting of "characters" for languages that use those.