On Mon, 2005-04-11 at 15:40, gcomnz wrote: > I have to say I'm slightly confused too for some languages, > especiallyfor syllabic alphabets. At the same time, I'm pretty clear > for CJK,Syllabaries, and alphabets, or at least I hope I'm clear (I > guess I'mabout to find out), .chars just returns the right unicode > level forwhatever the string contents requires.
> "abc".chars would return <a b c>, which I'm guessing would be > bytesize usually. Fair enough. > "ææè".chars would return <æãæãè>, which can probably be > expressed with > UTF8? I think you're confusing UTF8 (which can represent ALL Unicode characters) and "the UTF8 subset which consists of one-byte representations" (which happens to overlap with 7-bit ASCII). > >From Apocalyps 5: "Under level 2 Unicode support, a character > isassumed to mean a grapheme, that is, a sequence consisting of a > basecharacter followed by 0 or more combining characters." > Marcus Hmmm... that doesn't answer the ligature question clearly though. That answers for the case of combining diacritical marks: http://en.wikipedia.org/wiki/Combining_diacritical_mark e.g. <A Ì> vs "Ã", which is a pre-combined example, but there are (as I understand it), many valid examples which do not have a pre-combined representation in Unicode. But not for ligatures: http://en.wikipedia.org/wiki/Ligature_%28typography%29 which are, by definition, actually two or more unique characters which have a special typographical representation when adjacent. So, they are a single grapheme, but like I said: certain cultures would be shocked by a .chars that did not decompose their ligatures (and again, I'm mostly thinking Arabic, so I'd defer to someone who actually spoke Arabic and knows how they deal with this).