Author: simon Date: Wed Jan 23 05:05:28 2008 New Revision: 25172 Modified: trunk/docs/pdds/draft/pdd28_character_sets.pod
Log: [docs][pdds] A little bit more. Implementation section next week. Modified: trunk/docs/pdds/draft/pdd28_character_sets.pod ============================================================================== --- trunk/docs/pdds/draft/pdd28_character_sets.pod (original) +++ trunk/docs/pdds/draft/pdd28_character_sets.pod Wed Jan 23 05:05:28 2008 @@ -72,10 +72,86 @@ Dealing with variable-byte encodings is not fun; for instance, you need to do a bunch of computations every time you traverse a string. In order to make programming a lot easier, we define a Parrot native string -format to be an array of unsigned 32-bit Unicode codepoints. +format to be an array of unsigned 32-bit Unicode codepoints. This is +equivalent to UCS-4 except for the normalization form semantics +described below. + +This means that B<if> you've done the necessary checks, and hence you +know you're dealing with a Parrot native string, then you can continue to +program in the usual C idioms - for the most part. Of course you'll need +to be careful with your comparisons, since what you'll be getting back +will be a C<Parrot_UInt4> instead of a C<char>. =head2 Grapheme normalization form +Unicode characters can be expressed in a number of different ways +according to the Unicode Standard. This is partly to do with maintaining +compatibility with existing character encodings. For instance, in +Serbo-Croatian and Slovenian, there's a letter which looks like an C<i> +without the dot but with two grave (C<`>) accents. If you have an +especially good POD renderer, you can see it here: E<0x209>. + +There are two ways you can represent this in Unicode. You can use +character 0x209, also known as C<LATIN SMALL LETTER I WITH DOUBLE GRAVE>, +which does the job all in one go. This is called a "composed" character, +as opposed to its equivalent decomposed sequence: +C<LATIN SMALL LETTER I> (0x69) followd by C<COMBINING DOUBLE GRAVE ACCENT> +(0x30F). + +Unicode standardises in a number of "normalization forms" which +repesentation you should use. We're using an extension of Normalization +Form C, which says basically, decompose everything, then re-compose as +much as you can. So if you see the integer stream C<0x69 0x30F>, it +needs to be replaced by C<0x30F>. This means that Parrot string data +structures need to keep track of what normalization form a given string +is in, and Parrot must provide functions to convert between +normalization forms. + +Now, Serbo-Croat is sometimes also written with Cyrillic letters rather +than Latin letters. The Cyrillic equivalent of the above character is +not part of Unicode, but would be specified as a decomposed pair +C<CYRILLIC SMALL LETTER I> (0x438) C<COMBINING DOUBLE GRAVE ACCENT> +(0x30F). (This PDD does not require Parrot to convert strings between +differing political sensibilities.) However, it is still visible as one +character and despite being expressed even in NFC as two characters, +is still a single character as far as a human reader is concerned. + +Hence we introduce the the distinction between a "character" and a +"grapheme". This is a Parrot distinction - it does not exist in the +Unicode Standard. + +When Parrot target languages' regular expression engines wish to match +a grapheme, then NFC is clearly not normalized enough. This is why we +have defined a further normalization stage, NFG - Normalization Form +for Graphemes. + +NFG uses out-of-band signalling in the string to refer the conforming +implementation to a decomposition table. UCS-4 specifies an encoding for +Unicode codepoints from 0 to 0x7FFFFFFF. In other words, any codepoints +with the first bit set are undefined. We define these out-of-band +codepoints as indexes into a lookup table, which maps between a +temporary ID and its associated decomposition. + +In practice, this goes as follows: Assuming our Russified Serbo-Croat +string is the first string that Parrot sees, when it is converted to +Parrot's default format, it would be normalized to a single character +having the codepoint C<0x80000000>. At the same time, Parrot would +insert an entry into a temporary array at array index 0, consisting of +the bytestream C<0x00000438 0x000000030F> - that is, the Unicode +decomposition of the grapheme. + +This has one big advantage: applications which don't care about +graphemes can just pass the codepoint around as if it's any other number +- uh, character. Only applications which care about the specific +properties of Unicode characters need to take the overload of peeking +inside the array and reading the decomposition. + +Individual languages may need to think carefully about their concept of, +for instance, "the length of a string" to determine whether or not they +need to visit the lookup table for these strings. At any rate, +Parrot should provide both grapheme-aware and character-aware iterators +for string traversal. + =head1 IMPLEMENTATION =head2 Changes required to current string implementation @@ -86,9 +162,19 @@ =head2 String encoding API +=head2 String programming checklist + =head1 REFERENCES -List of references. +http://plan9.bell-labs.com/sys/doc/utf.html - Plan 9's Runes are not +dissimilar to Parrot's integer codepoints, and this is a good +introduction to the Unicode world. + +http://www.unicode.org/reports/tr15/ - The Unicode Consortium's +explanation of different normalization forms. + +"Unicode: A Primer", Tony Graham - Arguably the most readable book on +how Unicode works. =cut