Author: allison Date: Tue Apr 1 02:33:23 2008 New Revision: 26689 Modified: trunk/docs/pdds/draft/pdd28_character_sets.pod
Log: [pdd] Completed the core architectural component of the Strings PDD. Still adding API sections. Modified: trunk/docs/pdds/draft/pdd28_character_sets.pod ============================================================================== --- trunk/docs/pdds/draft/pdd28_character_sets.pod (original) +++ trunk/docs/pdds/draft/pdd28_character_sets.pod Tue Apr 1 02:33:23 2008 @@ -1,13 +1,13 @@ -# Copyright (C) 2001-2005, The Perl Foundation. +# Copyright (C) 2008, The Perl Foundation. # $Id$ =head1 NAME -docs/pdds/pdd28_character_sets.pod - Strings and character sets +docs/pdds/pdd28_strings.pod - Parrot Strings =head1 ABSTRACT -This PDD describes the conventions expected for users of Parrot strings, +This PDD describes the conventions for strings in Parrot, including but not limited to support for multiple character sets, encodings and languages. @@ -15,51 +15,16 @@ $Revision$ -=head1 DESCRIPTION - -Here is a summary of the design decisions described in this PDD. - -=over 3 - -=item * - -Parrot supports multiple string formats, and so users of Parrot strings -must be aware at all times of string encoding issues and how these -relate to the string interface. - -=item * - -The native Parrot string format is an array of 32-bit Unicode codepoints -in B<grapheme normalization form>. (NFG) - -=item * - -NFG is defined as a normalization which allocates at most one codepoint -to each visible character. - -=item * - -An interface is defined for interacting with Parrot strings and converting -between character sets and encodings. - -=back - -=head2 Definitions - -Here is a brief definition of some of the technical terms in this PDD. -For more detailed description, see any of the links at the bottom of -this PDD. - -=over 3 +=head1 DEFINITIONS -=item character +=head2 Character -A character is "the abstract description of a symbol". It's the smallest -chunk of text a computer knows how to deal with. Of course internally to +A character is the abstract description of a symbol. It's the smallest +chunk of text a computer knows how to deal with. Internally to the computer, a character (just like everything else) is a number, so -you need to define... +a few further definitions are needed. -=item character set +=head2 Character Set A character set is officially a deprecated term, with Unicode preferring the concepts of I<character repertoire> (a collection of characters) and @@ -67,28 +32,12 @@ which character in the repertoire). We still use it, though, to mean the standard which defines both a repertoire and a code. -=item codepoint +=head2 Codepoint A codepoint is the numeric representation of a character according to a given character set. So in ASCII, the character C<A> has codepoint 0x41. -=item combining character - -A combining character is a Unicode concept. It is a character which -modifies the preceding character. For instance, accents, lines, circles, -boxes, etc. which are not to be displayed on their own, but to be -"composed" with the preceding character. - -=item grapheme - -A grapheme is our concept. It is a character followed by all of its -combining characters; on other words, one or more characters forming a -visible whole when displayed. Parrot must support languages which -manipulate strings grapheme-by-grapheme, and since graphemes are the -highest-level interpretation of a "character", they're very useful for -converting between character sets. - -=item encoding +=head2 Encoding An encoding determines how a codepoint is represented inside a computer. Simple encodings like ASCII define that the codepoints 0-127 simply @@ -97,278 +46,250 @@ codepoints. Variable-width encodings like UTF-8 use one byte for codepoints 0-127, two bytes for codepoints 127-2047, and so on. -=back - -=head2 Encoding awareness - -Parrot was designed from the outset to support multiple string formats: -multiple character sets and multiple encodings. Unlike other such -projects, we don't standardize on Unicode internally. This is because -for the majority of use cases, it's still far more efficient to deal -with whatever input data the user sends us, which, equally in the -majority of use cases, is something like ASCII - or at least, some kind -of byte-based rather than character-based encoding. - -So internally, consumers of Parrot strings have to be aware that there -is a plurality of string encodings going on inside Parrot. (Producers of -Parrot strings can do whatever is most efficient for them.) The -implications of this for the internal API will be detailed in the -implementation section below, but to put it in simple terms: if you find -yourself writing C<*s++> or any other C string idioms, you need to stop -and think if that's what you really mean. Not everything is byte-based -any more. - -However, we're going to try to make it as easy for C<*s++>-minded people -as possible, and part of that is the declaration of a Parrot native -string format. You don't have to use it, but if you do all your dreams -will come true. - -=head2 Native string format - -Dealing with variable-byte encodings is not fun; for instance, you need -to do a bunch of computations every time you traverse a string. In order -to make programming a lot easier, we define a Parrot native string -format to be an array of unsigned 32-bit Unicode codepoints. This is -equivalent to UCS-4 except for the normalization form semantics -described below. - -This means that B<if> you've done the necessary checks, and hence you -know you're dealing with a Parrot native string, then you can continue to -program in the usual C idioms - for the most part. Of course you'll need -to be careful with your comparisons, since what you'll be getting back -will be a C<Parrot_Rune> instead of a C<char>. - -=head2 Grapheme normalization form - -Unicode characters can be expressed in a number of different ways -according to the Unicode Standard. This is partly to do with maintaining -compatibility with existing character encodings. For instance, in -Serbo-Croatian and Slovenian, there's a letter which looks like an C<i> -without the dot but with two grave (C<`>) accents. If you have an -especially good POD renderer, you can see it here: E<0x209>. - -There are two ways you can represent this in Unicode. You can use -character 0x209, also known as C<LATIN SMALL LETTER I WITH DOUBLE GRAVE>, -which does the job all in one go. This is called a "composed" character, -as opposed to its equivalent decomposed sequence: -C<LATIN SMALL LETTER I> (0x69) followed by C<COMBINING DOUBLE GRAVE ACCENT> -(0x30F). - -Unicode standardises in a number of "normalization forms" which -repesentation you should use. We're using an extension of Normalization -Form C, which says basically, decompose everything, then re-compose as -much as you can. So if you see the integer stream C<0x69 0x30F>, it -needs to be replaced by C<0x209>. This means that Parrot string data -structures need to keep track of what normalization form a given string -is in, and Parrot must provide functions to convert between -normalization forms. - -Now, Serbo-Croat is sometimes also written with Cyrillic letters rather -than Latin letters. The Cyrillic equivalent of the above character is -not part of Unicode, but would be specified as a decomposed pair -C<CYRILLIC SMALL LETTER I> (0x438) C<COMBINING DOUBLE GRAVE ACCENT> -(0x30F). (This PDD does not require Parrot to convert strings between -differing political sensibilities.) However, it is still visible as one -character and despite being expressed even in NFC as two characters, -is still a single character as far as a human reader is concerned. - -Hence we introduce the distinction between a "character" and a -"grapheme". This is a Parrot distinction - it does not exist in the -Unicode Standard. - -When a regular expression engine from one of Parrot's target languages -wishes to match a grapheme, then NFC is clearly not normalized enough. -This is why we have defined a further normalization stage, NFG - -Normalization Form for Graphemes. - -NFG uses out-of-band signalling in the string to refer the conforming -implementation to a decomposition table. UCS-4 specifies an encoding for -Unicode codepoints from 0 to 0x7FFFFFFF. In other words, any codepoints -with the first bit set are undefined. We define these out-of-band -codepoints as indexes into a lookup table, which maps between a -temporary ID and its associated decomposition. - -In practice, this goes as follows: Assuming our Russified Serbo-Croat string is -the first string that Parrot sees, when it is converted to Parrot's default -format, it would be normalized to a single character having the codepoint -C<0xFFFFFFFFF>. (In other words, -1; grapheme table entries count backwards to -allow Parrot to check if a character is a grapheme using a single -sign-comparison operation) At the same time, Parrot would insert an entry into -the global grapheme table, C<Parrot_grapheme_table>, at array index 0, -consisting of the bytestream C<0x00000438 0x000000030F> - that is, the Unicode -decomposition of the grapheme. - -This has one big advantage: applications which don't care about -graphemes can just pass the codepoint around as if it's any other number -- uh, character. Only applications which care about the specific -properties of Unicode characters need to take the overload of peeking -inside the array and reading the decomposition. - -Individual languages may need to think carefully about their concept of, -for instance, "the length of a string" to determine whether or not they -need to visit the lookup table for these strings. At any rate, -Parrot should provide both grapheme-aware and codepoint-aware iterators -for string traversal. - -=head2 The Parrot internal character type - -The other advantage of NFG is that it gives us a single value that can -unambiguously be passed between any charset/encoding pair: any variable -typed C<Parrot_Rune> is defined to be a C<Parrot_UInt4> Unicode -codepoint where values >= 0x80000000 are understood to be entries into the -global C<Parrot_grapheme_table> array. - -Strings in Parrot's native string format will I<probably> be an array of -C<Parrot_Rune>s. (It is possible that native strings may want to carry -around their own grapheme tables rather than use the global one; in -which case, their codepoints are better off called C<Parrot_UInt4>s, to -reserve the interpretation of the C<Parrot_Rune> type. But let's burn -that bridge when we come to it.) - -Because C<Parrot_Rune>s are a single unambiguous representation of a -character at the highest level Parrot will be required to support - that -is, in principle, any character from any character set can be -represented by at most one C<Parrot_Rune> - they are perfect as an -intermediate character representation for converting between string -types. - -=head2 Some useful statistics for optimizers - -I took a large sample of highly mixed Unicode data, generated by taking the -contents of English Wikipedia and removing 99% of the pure-ASCII -content, and looked at the distribution of UTF-8 bytes per character. - - 1 bytes: 116324899 - 2 bytes: 19229506 - 3 bytes: 12198853 - 4 bytes: 4170 - -There is a possible methodological problem here in that English -Wikipedia is more likely to contain content from languages "close" to -English and transliterations into Latin letters using accented forms -rather than "exotic" scripts. Nevertheless, as a first approximation it -appears that a majority of the world's data still fits in one byte of -UTF-8, but once you pass one byte, two or three UTF-8 bytes are (very -roughly speaking) equally likely. - -=head1 IMPLEMENTATION - -=head2 Division of labour between "charset" and "encoding" - Character sets and encodings are related but separate concepts. An encoding is the lower-level representation of a string's data, whereas the character set determines higher-level semantics. Typically, character set functions will ask a string's encoding functions to retrieve data from the string, and then process the retrieved data. -In (rare) cases where a character set has a particularly strong link -to an encoding, (ISO8859-X with fixed 8-bit being the prime example) -then the character set functions may, for optimization purposes, contain -code which bypasses the encoding functions and handles the string data -directly. However, an encoding check B<MUST> still be made and -equivalent code to do things "the slow way" must be included if the -check fails. +=head2 Combining Character -=head2 The global grapheme table +A combining character is a Unicode concept. It is a character which +modifies the preceding character. For instance, accents, lines, circles, +boxes, etc. which are not to be displayed on their own, but to be +composed with the preceding character. -When constructing strings in NFG, graphemes not expressible as a single -character in Unicode are represented by a grapheme ID which looks up -into the global grapheme table. When Parrot comes across a grapheme, it -must first determine whether or not the grapheme already has an entry -in the grapheme table. Therefore the table cannot strictly be an array, -as that would make lookup inefficient. The grapheme table is -represented, then, as both an array and a hash structure. The array -gives forward-lookup and the hash allows reverse lookup. Converting a -string of characters into a grapheme ID can be demonstrated with the -following Perl 5 pseudocode, for the grapheme C<0x438 0x30F>: +=head2 Grapheme - $codepoint = ($grapheme_lookup->{0x438}{0x30F} ||= do { - push @grapheme_table, "\x{438}\x{30F}"; - ~ $#grapheme_table; - }); - push @string, $codepoint; +In linguistics, a grapheme is a single symbol in a writing system (letter, +number, punctuation mark, kanji, hiragana, Arabic glyph, Devanagari symbol, +etc), including any modifiers (diacritics, etc). + +We've adopted the term grapheme to refer to one or more characters forming a +visible whole when displayed, in other words, a bundle of a character and all +of its combining characters. Parrot must support languages which manipulate +strings grapheme-by-grapheme, and since graphemes are the highest-level +interpretation of a "character", they're useful for converting between +character sets. + +=head2 Normalization Form + +A normalization form standardizes the representation of a string by +transforming a sequence of combining characters into a more complex character +(composition), or by transforming a complex character into a sequence of +composing characters (decomposition). The decomposition forms also define a +standard order for the composing characters, to allow string comparisons. The +Unicode Standard defines four normalization forms: NFC and NFKC are +composition, NFD and NFKD are decomposition. See L<Unicode Normalization +Forms|http://www.unicode.org/reports/tr15/> for more details. -=head2 Implications for string implementation +=head2 Grapheme Normalization Form -To provide an area for encoding-specific information, such as the -current normalisation form, the C<parrot_string_representation_t> is to -be re-worked as a void pointer to an encoding-specific representation -structure. For instance, one such structure for the Parrot native -encoding may look like this: - - typedef Parrot_UInt4 Parrot_Rune; - - typedef struct parrot_native_representation_t { - normalization_form_t normalization, - ... - }; - -=head2 String access API - -It is hard to fully lay out a list of what string functions will be -required as this will obviously be determined by use. This PDD assumes -for the moment that the current string functions will on the whole be -maintained, but will be adapted to use the ideas described here - -particularly the use of C<Parrot_Rune> as an intermediary character -representation, the addition of normalization form structures and the -C<Parrot_grapheme_table>. - -=head3 Conversions between normalization form, encoding and charset - -Given the existence of a single unambiguous character type, there is no -need for a large and complicated set of string conversion functions. All -conversion will be done with a function called C<string_slow_copy>: - - INTVAL string_slow_copy(STRING* src, STRING* dst) - -To convert a string from one format to another, simply create a new -empty string with the required attributes, and pass the source string -and this new string to C<string_slow_copy>. This function creates two -string iterators and calls C<get_and_advance> on the source string's -iterator to read each character in turn into a C<Parrot_Rune>, and then -calls C<set_and_advance> on the destination string's iterator to append -the character. This will intrinsically perform a conversion using -C<Parrot_Rune>s as the intermediary. - -Note that C<encoding_get_codepoint> on strings which are not in NFG may -need to read ahead multiple characters in order to turn them into a -single grapheme, in order to return a C<Parrot_Rune>. +Grapheme normalization form (NFG) is a normalization which allocates exactly +one codepoint to each grapheme. -=head2 String programming checklist +=head1 DESCRIPTION =over 3 =item * -Are you manipulating the string data directly? It's OK to do this if you -know that the string is in Parrot's native format and you're using -C<Parrot_Rune>s. It's generally not OK to do this otherwise. +Parrot supports multiple string formats, and so users of Parrot strings must be +aware at all times of string encoding issues and how these relate to the string +interface. + +=item * + +Parrot provides an interface for interacting with strings and converting +between character sets and encodings. =item * -Related to the above, if you see C<*s++> in code you've just written, -then warning bells should be sounding. +Operations that require understanding the semantics of a string must respect +the character set (character repertoire and character code) of the string. =item * -C<Parrot_Rune>s are guaranteed to be Unicode codepoints up to -C<0x7FFFFFFF> and Parrot graphemes above that. Nothing else is -guaranteed to be in Unicode; operations which require understanding the -semantics of the string must go through the character set structure. +Operations that require understanding the layout of the string must respect the +encoding of the string. =item * -Similarly, operations which required understanding of the layout of the -string must go through the encoding structure. +In addition to common string formats, Parrot provides an additional string +format that is a sequence of 32-bit Unicode codepoints in NFG. =back +=head1 IMPLEMENTATION + +Parrot was designed from the outset to support multiple string formats: +multiple character sets and multiple encodings. We don't standardize on Unicode +internally, because for the majority of use cases, it's still far more +efficient to deal with whatever input data the user sends us. + +Consumers of Parrot strings need to be aware that there is a plurality of +string encodings inside Parrot. (Producers of Parrot strings can do whatever is +most efficient for them.) To put it in simple terms: if you find yourself +writing C<*s++> or any other C string idioms, you need to stop and think if +that's what you really mean. Not everything is byte-based any more. + +=head2 Grapheme Normalization Form + +Unicode characters can be expressed in a number of different ways according to +the Unicode Standard. This is partly to do with maintaining compatibility with +existing character encodings. For instance, in Serbo-Croatian and Slovenian, +there's a letter which looks like an C<i> without the dot but with two grave +(C<`>) accents (E<0x209>). Unicode can represent this letter as a composed +character C<0x209>, also known as C<LATIN SMALL LETTER I WITH DOUBLE GRAVE>, +which does the job all in one go. It can also represent this letter as a +decomposed sequence: C<LATIN SMALL LETTER I> (C<0x69>) followed by C<COMBINING +DOUBLE GRAVE ACCENT> (C<0x30F>). We use the term I<grapheme> to refer to a +"letter" whether it's represented by a single codepoint or multiple codepoints. + +String operations on this kind of variable-byte encoding can be complex and +expensive. Operations like comparison and traversal require a series of +computations and lookaheads, since any given grapheme may be a sequence of +combining characters. The Unicode Standard defines several "normalization +forms" that help with this problem. Normalization Form C (NFC), for example, +decomposes everything, then re-composes as much as possible. So if you see the +integer stream C<0x69 0x30F>, it needs to be replaced by C<0x209>. However, +Unicode's normalization forms don't go quite far enough to completely solve the +problem. For example, Serbo-Croat is sometimes also written with Cyrillic +letters rather than Latin letters. Unicode doesn't have a single composed +character for the Cyrillic equivalent of the Serbo-Croat C<LATIN SMALL LETTER I +WITH DOUBLE GRAVE>, so it is represented as a decomposed pair C<CYRILLIC SMALL +LETTER I> (C<0x438>) with C<COMBINING DOUBLE GRAVE ACCENT> (C<0x30F>). This +means that even in the most normalized Unicode form, string manipulation code +must always assume a variable-byte encoding, and use expensive lookaheads. The +cost is incurred on every operation, though the particular string operated on +might not contain combining characters. It's particularly noticable in parsing +and regular expression matches, where backtracking operations may retraverse +the characters of a simple string hundreds of times. + +In order to reduce the cost of variable-byte operations and simplify some +string manipulation tasks, Parrot defines an additional normalization: +Normalization Form G (NFG). In NFG, every grapheme is guaranteed to be +represented by a single codepoint. Graphemes that don't have a single codepoint +representation in Unicode are given a dynamically generated codepoint unique to +the NFG string. + +An NFG string is a sequence of signed 32-bit Unicode codepoints. It's +equivalent to UCS-4 except for the normalization form semantics. UCS-4 +specifies an encoding for Unicode codepoints from 0 to 0x7FFFFFFF. In other +words, any codepoints with the first bit set are undefined. NFG interprets the +unused bit as a sign bit, and reserves all negative codepoints as dynamic +codepoints. A negative codepoint acts as an index into a lookup table, which +maps between a dynamic codepoint and its associated decomposition. + +In practice, this goes as follows: When our Russified Serbo-Croat string is +converted to NFG, it is normalized to a single character having the codepoint +C<0xFFFFFFFFF> (in other words, -1 in 2's complement). At the same time, Parrot +inserts an entry into the string's grapheme table at array index -1, containing +the Unicode decomposition of the grapheme C<0x00000438 0x000000030F>. + +Parrot will provide both grapheme-aware and codepoint-aware string operations, +such as iterators for string traversal and calculations of string length. +Individual language implementations can choose between the two types of +operations depending on whether their string semantics are character-based or +codepoint-based. For languages that don't currently have Unicode support, the +grapheme operations will allow them to safely manipulate Unicode data without +changing their string semantics. + +=head3 Advantages + +Applications that don't care about graphemes can handle a NFG codepoint in a +string as if it's any other character. Only applications that care about the +specific properties of Unicode characters need to take the load of peeking +inside the grapheme table and reading the decomposition. + +Using negative numbers for dynamic codepoints allows Parrot to check if a +particular codepoint is dynamic using a single sign-comparison operation. It +also means that NFG can be used without conflict on encodings from 7-bit +(signed 8-bit integers) to 63-bit (using signed 64-bit integers) and beyond. + +Because any grapheme from any character set can be represented by a single NFG +codepoint, NFG strings are useful as an intermediate representation for +converting between string types. + +=head3 Disadvantages + +A 32-bit encoding is quite large, considering the fact that the Unicode +codespace only requires up to C<0x10FFFF>. The Unicode Consortium's FAQ notes +that most Unicode interfaces use UTF-16 instead of UTF-32, out of memory +considerations. This means that although Parrot will use 32-bit NFG strings for +optimizations within operations, for the most part individual users should use +the native character set and encoding of their data, rather than using NFG +strings directly. + +The conceptual cost of adding a normalization form beyond those defined in the +Unicode Standard has to be considered. However, to fully support Unicode, +Parrot already needs to keep track of what normalization form a given string is +in, and provide functions to convert between normalization forms. The +conceptual cost of one additional normalization form is relatively small. + +=head3 The grapheme table + +When constructing strings in NFG, graphemes not expressible as a single +character in Unicode are represented by a dynamic codepoint index into the +string's grapheme table. When Parrot comes across a multi-codepoint grapheme, +it must first determine whether or not the grapheme already has an entry in the +grapheme table. Therefore the table cannot strictly be an array, as that would +make lookup inefficient. The grapheme table is represented, then, as both an +array and a hash structure. The array interface provides forward-lookup and the +hash interface reverse lookup. Converting a multi-codepoint grapheme into a +dynamic codepoint can be demonstrated with the following Perl 5 pseudocode, for +the grapheme C<0x438 0x30F>: + + $codepoint = ($grapheme_lookup->{0x438}{0x30F} ||= do { + push @grapheme_table, "\x{438}\x{30F}"; + ~ $#grapheme_table; + }); + push @string, $codepoint; + +=head2 String API + +Strings have the following structure: + + struct parrot_string_t { + UnionVal cache; + Parrot_UInt flags; + char *strstart; + UINTVAL bufused; + UINTVAL strlen; + const struct _encoding *encoding; + const struct _charset *charset; + const struct _normalization *normalization; + UINTVAL hashval; + }; + +Deprecation note: the enum C<parrot_string_representation_t> will be removed. + +The current string functions will on the whole be maintained, with some +modifications for the addition of the NFG string format. + +=head3 Conversions between normalization form, encoding, and charset + +Conversion will be done with a function called C<string_grapheme_copy>: + + INTVAL string_grapheme_copy(STRING* src, STRING* dst) + +Converting a string from one format to another involves creating a new empty +string with the required attributes, and passing the source string and the new +string to C<string_grapheme_copy>. This function iterates through the source +string one grapheme at a time, using the character set function pointer +C<get_grapheme> (which may read ahead multiple characters with strings that +aren't in NFG). For each source grapheme, the function will call +C<set_grapheme> on the destination string (which may append multiple characters +in non-NFG strings). This conversion effectively uses an intermediate NFG +representation. + +=head2 String PMC API + =head1 REFERENCES -http://plan9.bell-labs.com/sys/doc/utf.html - Plan 9's Runes are not -dissimilar to Parrot's Runes, and this is a good introduction to the -Unicode world. (Also available at -http://sirviente.9grid.es/sources/plan9/sys/doc/utf.ps ) +http://sirviente.9grid.es/sources/plan9/sys/doc/utf.ps - Plan 9's Runes are not +dissimilar to NFG strings, and this is a good introduction to the Unicode +world. http://www.unicode.org/reports/tr15/ - The Unicode Consortium's explanation of different normalization forms.