On Mon, 2005-04-11 at 15:40, gcomnz wrote:
> I have to say I'm slightly confused too for some languages,
> especiallyfor syllabic alphabets. At the same time, I'm pretty clear
> for CJK,Syllabaries,  and alphabets, or at least I hope I'm clear (I
> guess I'mabout to find out), .chars just returns the right unicode
> level forwhatever the string contents requires.

> "abc".chars  would return <a b c>, which I'm guessing would be
> bytesize usually.

Fair enough.

> "ææè".chars would return <æãæãè>, which can probably be 
> expressed with
> UTF8?

I think you're confusing UTF8 (which can represent ALL Unicode
characters) and "the UTF8 subset which consists of one-byte
representations" (which happens to overlap with 7-bit ASCII).

> >From Apocalyps 5: "Under level 2 Unicode support, a character
> isassumed to mean a grapheme, that is, a sequence consisting of a
> basecharacter followed by 0 or more combining characters."
> Marcus

Hmmm... that doesn't answer the ligature question clearly though. That
answers for the case of combining diacritical marks: 

        http://en.wikipedia.org/wiki/Combining_diacritical_mark

e.g. <A Ì> vs "Ã", which is a pre-combined example, but there are (as I
understand it), many valid examples which do not have a pre-combined
representation in Unicode.

But not for ligatures:

        http://en.wikipedia.org/wiki/Ligature_%28typography%29

which are, by definition, actually two or more unique characters which
have a special typographical representation when adjacent. So, they are
a single grapheme, but like I said: certain cultures would be shocked by
a .chars that did not decompose their ligatures (and again, I'm mostly
thinking Arabic, so I'd defer to someone who actually spoke Arabic and
knows how they deal with this).


Reply via email to