On Jun 14, 2004, at 1:54 PM, Dan Sugalski wrote:
Parrot provides code points for all graphemes, even for those
character sets/encodings which don't inherently do so. Most sets that
have variable-length encodings use an escape sequence scheme--the
value of the first byte in a character determines whether the
grapheme is a one or more byte sequence. When parrot turns these into
code points it does it by building up the final value. The first byte
is put in the low 8 bits of the integer. If there's a second byte in
the sequence the current value is shifted left 8 bits and the new byte
is stuffed in the low 8 bits. If there's a third byte in the sequence
everything is shifted left again 8 bits and that third byte is stuffed
in the bottom, and so on.

A grapheme consists of one or more code points. Is "provides code points for all graphemes" really what is intended here? I assume not, since you can't represent every combination of combining Unicode characters (COMBINING GRAVE ACCENT + KATAKANA LETTER KA, say) in a single 32-bit code point.


                      - Damien



Reply via email to