Parrot provides code points for all graphemes, even for those character sets/encodings which don't inherently do so. Most sets that have variable-length encodings use an escape sequence scheme--the value of the first byte in a character determines whether the grapheme is a one or more byte sequence. When parrot turns these into code points it does it by building up the final value. The first byte is put in the low 8 bits of the integer. If there's a second byte in the sequence the current value is shifted left 8 bits and the new byte is stuffed in the low 8 bits. If there's a third byte in the sequence everything is shifted left again 8 bits and that third byte is stuffed in the bottom, and so on.
A grapheme consists of one or more code points. Is "provides code points for all graphemes" really what is intended here? I assume not, since you can't represent every combination of combining Unicode characters (COMBINING GRAVE ACCENT + KATAKANA LETTER KA, say) in a single 32-bit code point.
- Damien