Re: Question about list context for String.chars

Aaron Sherman Mon, 11 Apr 2005 12:55:43 -0700

On Mon, 2005-04-11 at 15:40, gcomnz wrote:
> I have to say I'm slightly confused too for some languages,
> especiallyfor syllabic alphabets. At the same time, I'm pretty clear
> for CJK,Syllabaries,  and alphabets, or at least I hope I'm clear (I
> guess I'mabout to find out), .chars just returns the right unicode
> level forwhatever the string contents requires.


> "abc".chars  would return <a b c>, which I'm guessing would be
> bytesize usually.

Fair enough.

> "ææè".chars would return <æãæãè>, which can probably be 
> expressed with
> UTF8?

I think you're confusing UTF8 (which can represent ALL Unicode
characters) and "the UTF8 subset which consists of one-byte
representations" (which happens to overlap with 7-bit ASCII).

> >From Apocalyps 5: "Under level 2 Unicode support, a character
> isassumed to mean a grapheme, that is, a sequence consisting of a
> basecharacter followed by 0 or more combining characters."
> Marcus

Hmmm... that doesn't answer the ligature question clearly though. That
answers for the case of combining diacritical marks: 

        http://en.wikipedia.org/wiki/Combining_diacritical_mark

e.g. <A Ì> vs "Ã", which is a pre-combined example, but there are (as I
understand it), many valid examples which do not have a pre-combined
representation in Unicode.

But not for ligatures:

        http://en.wikipedia.org/wiki/Ligature_%28typography%29

which are, by definition, actually two or more unique characters which
have a special typographical representation when adjacent. So, they are
a single grapheme, but like I said: certain cultures would be shocked by
a .chars that did not decompose their ligatures (and again, I'm mostly
thinking Arabic, so I'd defer to someone who actually spoke Arabic and
knows how they deal with this).

Re: Question about list context for String.chars

Reply via email to