Dan suggested that I make this document public. I've been reluctant to do so, for several reasons, the most important being that I'm far, FAR, behind in the current Parrot developments.
Therefore, this document will probably rudely stomp right over any recent string, encoding and regular expression work that has been going on. Its idea of vtables are just much handwaving, I really don't know how vtables really work or how they are currently implemented in Parrot. However, I have spent some serious thought on this, based on the appropriate amount of fun I've had lately with the Unicode in the upcoming Perl 5.8, and some familiarity with the Unicode standard. The most important message is that give up on 8-bit bytes, already. Time to move on, chop chop. The document itself is a collection (what we experts technically call "a mess") of theory and practice: character model and statements about how things should be, and how things should probably be implemented. I've asked Larry about whether the ICU license would be compatible with whatever Parrot will have, but he's currently cruising... The bad (or good?) news is that I don't have the time to work on this design document much more, maintain or develop it, and certainly not implement any of it (another reason why I have been reluctant to publish this). But that's how it is. -- $jhi++; # http://www.iki.fi/jhi/ # There is this special biologist word we use for 'stable'. # It is 'dead'. -- Jack Cohen
=head1 TITLE Parrot String Handling =head1 VERSION $Id: parrotstring.pod,v 1.16 2002/01/18 03:57:07 jhi Exp $ =head2 Current Maintainer: Some Body Class: Internals PDD Number: X Version: 1.0 Status: Developing Last Modified: PDD Format: 1 Language: English =head2 History None. First version. (Jarkko Hietaniemi) =head1 CHANGES None. First version. =head1 ABSTRACT This PDD specifies how Parrot should handle characters and text, the character encoding model, and the basic operations. =head1 DESCRIPTION Since discussion of character sets, encodings, and the like quite often confuses people considerably, let's first define some terms for our character encoding model. The terminology follows the Unicode Technical Report #17, I<Character Encoding Model>, http://www.unicode.org/unicode/reports/tr17/ . The character encoding model consists of five layers: =head2 Terminology =over 4 =item I<ACR> I<Abstract Character Repertoire> - the (unordered) set of characters to be encoded. Unicode defines one (open-ended) repertoire, the ISO 8859 are examples of (closed) repertoires. =item I<CCS> I<Coded Character Set> - a mapping from an abstract character repertoire to a set of non-negative integers known as I<code points>. This is how the Unicode C<LATIN CAPITAL LETTER A> is mapped to U+0041. Alternative names for CCSs are I<character encoding>, a I<coded character repertoire>, a <character set definition>, or a <code page>. Examples include Unicode, ISO 8859, the Windows code pages, and the JIS X 0208. The code point space (from zero to some positive number) can be flat or complex (some ranges may be unused or forbidden). =item I<CEF> I<Character Encoding Form> - a mapping from a set of non-negative integers (a CCS) to set of sequences of particular I<code units> of some specified (computer-architecture imposed) width, such as bytes. CEFs can be fixed-width or variable-width. Note that (especially historically) in many cases CCS and CEF have used the same (often byte) fixed width, so they have been practically identical. But don't let that fool you. A byte sequence can be either I<assigned> (it represents a valid and assigned character in the code point space), I<unassigned> (it represents a valid but unused code point), or I<illegal> (for some reason forbidden). Examples of fixed-width CEFs include 7-bit, 8-bit, 16-bit, and 32-bit. Note that these are independent of any particular CCS. 8-bit CEF can be used for encoding several different CCSs, such as the ISO 8859 and the different EBCDICs. Examples of variable-width CEFs include UTF-8 and UTF-16. =item I<CES> I<Character Encoding Scheme> - a mapping from a set of sequences of code units (from one or more CEFs) to a serialised sequence of bytes. Serialisation is important for transmission and storage. It's important not to confuse a CEF and a CES. A CEF maps code points to code units (numbers) but a CES maps code units to serialised sequences of bytes. Therefore a CES must take into account byteordering. A CES may also include more than one CCS, and it may include non-textual data such as shift sequences. Examples of CESs include UTF-8 (yes, UTF-8 is both a CEF and a CES), UTF-16BE, and UTF-16LE from Unicode, and the Asian ISO 2022 and EUC. =item I<TES> I<Transfer Encoding Syntax> - a reversible transform of encoded data, which may or may not contain textual data. These include various Internet or other transmission/storage transformation to either encapsulate/protect data for transmission or to compress it. Examples include base64, uuencode, quoted-printable, zip, bzip2. =back In addition to the above: I<CM> - I<Character Map> - is a mapping that combines ACR, CCS, CES and a CEF. Character Maps are what in the IAB architecture get the IANA charset identifiers. TES is assumed to lie beyond the Parrot core or I/O. =head2 Parrot's Character Encoding Model Parrot will use the Unicode encoding mode model layers ACR, CCS, CEF, and CES; that is, use the Unicode standard as an abstract character repertoire, coded character set, character encoding from, and character encoding scheme. B<Any code points in Parrot will be Unicode code points.> All methods of creating strings (in Perl: chr(), \xHH, \x{HH...}, \N{...}, pack) should create Unicode, and all methods of converting strings to code points (in Perl: ord(), and unpack()), should generate Unicode code points. =head2 Of Bits and Bytes There is no string type built out of native eight-bit bytes. One may need such a data type, though, for representing bit strings (in Perl: vec(), and the associated bitstring operators), and maybe also for completely opaque binary objects ("no user serviceable parts inside"), though especially for the latter some other data type than a "string" may be preferable. Because a bitstring has no ACS or CCS, nor CES, it is just a CEF without any character coding. Of course, traditionally a bit pattern like 000001001 has been associated with, say, "A", that is to considered purely an accident of the implementation. It is suggested that for Parrot the bitstring data type is completely dissociated from the string type, and trying to mix the two should be considered an error (except maybe for data-type crossing interfaces, like Perl's pack() and unpack()). Bitstrings have a rich array of their own operations, and it is also suggested that the "libc" equivalent of Parrot contain efficient implementations of these bitstring operations. =head1 IMPLEMENTATION =head2 Internal Storage The internal storage of text is based on three layers of types. The basic type is typedef enum { UNICODE_CEF_UTF_8, UNICODE_CEF_UTF_16, UNICODE_CEF_UTF_32, } PARROT_CEF; typedef struct { PARROT_CEF CEF; UINTVAL size; char data[1]; } PARROT_STRING; The C<CEF> defines the character encoding form being used. The C<size> tells how many bytes are contained in the C<data>. (The data[1] hack is used to allow for a single-pass allocation of PARROT_STRINGs.) B<The preferred CEF for Parrot is UTF-16.> Though UTF-8 is more space-saving (for the first 0x7FF of the Unicode CEF) (and used in in many implementations, like Perl 5) it introduces considerable code complexity if used used as the primary internal storage format. Fast low-level transforms between all the CEFs are required. Note, however, that UTF-16 still is a variable-width encoding, and relatively complex at that, thanks to the "surrogate discontinuity" from U+D800..1.U+DFFFF (which enables the UTF-16 to reach the U+10000...U+10FFFFF code point space required by Unicode 3.1 and later.) For implementation efficiency, it may be useful to split UTF-16 into two subclasses UNICODE_CEF_UCS_2 UNICODE_CEF_UTF_16_SURROGATE where the UCS-2 would be 2-byte "simple" UTF-16 without surrogates and the UTF-16 surrogate would potentially contain "complex" 4-byte surrogate pairs encoding code points beyond U+FFFF. This, of course, is premature optimisation, and let's not enter that particular evil before performance testing strongly suggests we have to. One possibility would be to use UTF-32 which would be fixed-width and without code range complexities, but the wasted space might be too much to bear in normal use. (However, it would be interesting to have all the internal string functions implemented also in UTF-32 so that one could first do space/time benchmarking, and then possibly use UTF-32 anyway if that turns out to be considerably faster, and one has the memory to spare.) The PARROT_STRING as such doesn't know how many characters there are in the string contained in the C<data>. For that the following type is used to wrap the PARROT_STRING: typedef struct { INTVAL size; INTVAL index; INTVAL index_offset; INTVAL last_offset; INTVAL size_valid:1; INTVAL offset_valid:1; INTVAL last_valid:1; PARROT_STRING string; } PARROT_SIZED_STRING; Whenever some computation determines the number of characters in C<data>, it must update the C<size> field and set the C<size_valid> boolean. Whenever some computation changes the C<data> in a way that changes the number of characters, the computation must also clear the C<size_valid> boolean. Whenever some computation determines the first byte of the last character in C<data>, it must update the C<last_offset> to point to that first byte, and set the C<last_valid> boolean. Whenever some computation invalidates the C<last_offset>, the computation must also clear the C<last_valid> boolean. Whenever some computation determines the first byte of the C<index>-th character in C<data>, it must update the C<index_offset> to point to the first bytes, and set the C<index_valid> boolean. Whenever some computation invalidates the C<index_offset> (changes the length of a subsequence of C<data> before the C<index_offset>), the computation must also clear the C<index_valid> boolean. This is intended to serve as a very light-weight cache of the character offset. The C<index_offset> and C<last_offset> are offsets by design, as opposed to pointers. This way one cannot accidentally mix pointers from different allocations; all operations are to be relative to the C<data>. The third and final layer of Parrot string is localised string. Many string operations are dependent on a I<locale>, a collection of language/country/etc dependent behaviours. If the C<locale_vtable> is NULL, the default (locale-independent) behaviours are used. How locale names are mapped into customised vtable operations is outside the scope of Parrot. However, B<it is strongly suggested that any locale implementations recognise the basic ll_CC syntax>, where I<ll> is an ISO 639 language code, and I<CC> is an ISO 3169 country code. But what is the actual set of behaviours behind, say, C<de_DE>, C<fr_FR>, or <fr_CA>, has unfortunately not been standardised. For the locale data (and therefore behaviours) the use of something widely available such as the IBM ICU is suggested, see http://oss.software.ibm.com/icu/ Note that the legacy 8-bit locale implementation supplied by many systems is basically unusable with Unicode because there are no portable ways to map between the 8-bit locales and Unicode, and because the 8-bit legacy locale implementations are not portable even between themselves. All the solutions one can possibly come up with are vendor-dependent. The contents of the C<locale_vtable> are discussed under L</Operations>. The C<normalized_string> and C<collated_string> refer to the Unicode-normalized version of the string and to the Unicode collation I<sort key>, and they are discussed under L</Normalization> and L</Collation>. typedef struct { PARROT_STRING_LOCALE_VTABLE* locale_vtable; PARROT_NORMALIZED_STRING* normalized_string; PARROT_COLLATED_STRING* collated_string; PARROT_SIZED_STRING sized_string; } PARROT_LOCALIZED_STRING; typedef PARROT_LOCALIZED_STRING parrot_string; The C<sized_string> in the above is known as the I<primary string>, as opposed to the secondary strings, the normalized string and collated string, which are transformed versions of the primary string needed only for certain operations. All string updates should modify the primary string and invalidate the secondary strings. Note that there are further candidates for transformed versions of the primary string, namely the case-changed versions of the string: upper, lower, title, and fold. Those would work fine for casing functions, (in Perl: uc(), lc(), ucfirst(), lcfirst()). However, for use in case-ignoring pattern matching simply caching the case-changed string would not be enough: since in Unicode the casing operations can transform one character to many, a way to map byte offsets pointing to the first bytes of characters in either the primary or the case-changed version to byte offsets in the other string would be needed. Additional twists are that the "fold" casing is locale-dependent (certain characters behave differently under Turkish and Azeri), and in one particular case position-dependent (how the Greek sigma character behaves depends on its surrounding characters). =head2 Operations The operations are assumed to be vtable operations, so the parent "object" (the string being operated on) is never shown explicitly. Because all the three CEFs (UTF-8, UTF-16, UTF-32) are valid as low-level encoding of strings, most, if not all, of these operations need to be coded in triplicate, one for each CEF (and in quadruplicate, if the implementation of UTF-16 is split into UCS-2 and "true" UTF-16). Some error codes: typedef enum { UNICODE_ILLEGAL_CODE_POINT = -1, BEFORE_STRING_BEGINNING = -2, AFTER_STRING_END = -3, NO_MATCH = -4, } PARROT_STRING_RETURN_CODE; B<NOTE THAT THE GREAT CARE MUST BE EXERCISED TO AUDIT THE UNICODE IMPLEMENTATION FOR SECURITY:> against accepting illegal input, against buffer overflows, and against generating illegal byte sequences or code points. Do pass in the maximum extent of a byte sequence: just passing in the pointer to the sequence is bad. Do preserve enough space for the result: for example the case transformations may require up to I<three> times more space than the original. Do not expect to able to pass in a code point and return a code point: many Unicode operations are 1:n or N:1. Do not under any circumstances generate illegal UTF-8 (such as surrogates, or 6-byte sequences for Plane 1 characters, you should be generating 4-byte sequences for those); warn about receiving illegal UTF-8: illegal UTF-8 byte sequences are indicative of and conducive to buffer overflows. See http://www.unicode.org/unicode/uni2errata/UTF-8_Corrigendum.html and http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt. =over 4 =item code_point INTVAL code_point(INTVAL offset); Given a byte offset to the first byte of a character, return the code point of the character at that offset (a CEF -> CCS conversion). If the byte sequence contains an illegal code point (such as a surrogate in UTF-8), UNICODE_ILLEGAL_CODE_POINT will be returned. =item next_character INTVAL next_character(INTVAL offset); Given a byte offset to the first byte of a character, return the offset to the first byte of the next character. If at the first byte of the last character, AFTER_STRING_END will be returned. =item previous_character INTVAL previous_character(INTVAL offset); Given a byte offset to the first byte of a character, return the offset to the first byte of the previous character. If at the first byte of the first character, BEFORE_STRING_BEGINNING will be returned. =item skip_characters INTVAL skip_characters(INTVAL offset, INTVAL skip); Given a byte offset to the first byte of a character, return the offset to the first byte of the character skip characters before or after the first character. If the skip would go beyond the beginning or the end of the string, BEFORE_STRING_BEGINNING or AFTER_STRING_END will be returned. =item distance_in_characters INTVAL distance_in_characters(INTVAL offset_a, INTVAL offset_b); Given two byte offsets pointing to the first bytes of two characters, return the distance in characters between the two characters. If either of the offsets point outside the string, BEFORE_STRING_BEGINNING or AFTER_STRING_END will be returned. =item index_of_offset INTVAL index_of_offset(INTVAL offset); Given a byte offset to the first byte of a character, return the index of the character in characters from the beginning of the string. character. If the character would be beyond the end of the string, AFTER_STRING_END will be returned. =item offset_of_index INTVAL offset_of_index(INTVAL index); Given a character index, return the bte offset to the C<index>-th character. If the character would be beyond the end of the string, AFTER_STRING_END will be returned. Index wrapping from the end of the string (so that for example -2 would be the second to last character) is suggested to be delegated to higher level APIs. =item get_substring INTVAL get_substring(INTVAL index, INTVAL size, parrot_string *dststring); Given a character index and a character count (size) and a destination string, set in the destination string the data to point to the data of the parent string, the offset to a point to the right offset for the C<index>-th character, and the size of the destination string to C<size>. To really make a copy, you need to use copy_string() on the destination string. If index would be beyond the end of the string, 0 (zero) will be returned. Otherwise, 1 (one) will be returned. Index wrapping from the end of the string (such as -2 meaning the second to last character), or size truncation at the end of the string, are suggested to be delegated to higher lever APIs. =item set_substring INTVAL get_substring(INTVAL index, INTVAL size, parrot_string *srcstring); Given a character index and a character count (size) and a source string, set in the parent string the subsequence of C<size> bytes at the C<index>-th character of destination data to point to the data of the source string (releasing any possible old destination data), and the size of the parent string to the updated length. If the index or the end of the updated data would be beyond the end of the string, 0 (zero) will be returned. Otherwise, 1 (one) will be returned. Index wrapping from the end of the string (such as -2 meaning the second to last character), or size truncation at the end of the string, or string extension (probably using insert_characters()), are suggested to be delegated to higher lever APIs. =item copy_string void* copy_string() Creates a new copy of the data of the string and sets the data of the string to point to that new copy. B<Does not release the old data, that is left for higher level APIs.> Returns the pointer to the new data, or the NULL pointer in case of a failure. =item insert_characters INTVAL insert_characters(INTVAL index, INTVAL character, INTVAL size); Insert C<size> C<character> characters into the string at the character index C<index>. If the index or the end of the updated data would be beyond the end of the string, 0 (zero) will be returned. Otherwise, 1 (one) will be returned. Index wrapping from the end of the string (such as -2 meaning the second to last character), or size truncation at the end of the string, are suggested to be delegated to higher lever APIs. This is one of the only two APIs that modify their parent string, the other one is delete_characters(). =item delete_characters INTVAL delete_characters(INTVAL index, INTVAL size); Delete C<size> zero characters from the string at the character index C<index>. If the index or the end of the updated data would be beyond the end of the string, 0 (zero) will be returned. Otherwise, 1 (one) will be returned. Index wrapping from the end of the string (such as -2 meaning the second to last character), or size truncation at the end of the string, are suggested to be delegated to higher lever APIs. This is one of the only two APIs that modify their parent string, the other one is insert_characters(). =item replace_characters INTVAL insert_characters(INTVAL index, INTVAL character, INTVAL size); Replace C<size> characters in the string at the character index C<index> with C<size> copies of the character C<character>. If the index or the end of the updated data would be beyond the end of the string, 0 (zero) will be returned. Otherwise, 1 (one) will be returned. Index wrapping from the end of the string (such as -2 meaning the second to last character), or size truncation at the end of the string, or string extension (probably using insert_characters()), are suggested to be delegated to higher lever APIs. =item find_substring INTVAL find_substring(parrot_string *substring, INTVAL index, INTVAL direction); Given a character index to a string find the abs(direction)-th occurrence after (if direction > 0) or before (if direction < 0) the C<index>-th character. If a match is found, the offset to the match is returned, NO_MATCH otherwise. If the C<direction> is zero, the substring is compared against the parent string, at the C<index>th character. If they match, the offset to the match is returned, NO_MATCH otherwise. =item to_uppercase INTVAL to_uppercase(INTVAL code_point, parrot_string *upperstring); Given a code point, the uppercase string (if not NULL) is updated to contain the Unicode uppercase version of the code point as described in the Unicode Technical Report #21, I<Case Mappings>, http://www.unicode.org/unicode/reports/tr21/ . If there is an uppercase mapping for the code point, 1 (one) is returned, 0 (zero) otherwise. Note that simple one code point to another code point mapping would not be enough since case mapping is much more complex than that. The same applies to to_lowercase(), to_titlecase(), and to_foldcase(). =item to_lowercase INTVAL to_lowercase(INTVAL code_point, parrot_string *lowerstring); Given a code point, the lowercase string is updated to contain the Unicode lowercase version of the code point as described in the Unicode TR #21. If there is an lowercase mapping for the code point, 1 (one) is returned, 0 (zero) otherwise. =item to_titlecase INTVAL to_titlecase(INTVAL code_point, parrot_string *titlestring); Given a code point, the titlecase string is updated to contain the Unicode titlecase version of the code point as described in the Unicode TR #21. If there is an titlecase mapping for the code point, 1 (one) is returned, 0 (zero) otherwise. =item to_foldcase INTVAL to_foldcase(INTVAL code_point, parrot_string *foldstring); Given a code point, the foldcase string is updated to contain the Unicode case folding version of the code point as described in the Unicode TR #21. If there is a case folding mapping for the code point, 1 (one) is returned, 0 (zero) otherwise. B<Note that case folding is the version of the text to use for case-ignoring matching, not for lowercasing.> (Unicode calls case-ignoring I<loose matching>.) (In Perl: foldcase is used for C<//i>, but lowercase for C<lc()>.) Foldcase and lowercase are very similar transformations, but they are not identical. =item get_property void* get_property(INTVAL code_point, INTVAL table, INTVAL property); Given a code point, a Unicode data table, and a property, return a pointer to the value of the property. For example: /* get the general category (such as "Lu" for uppercase letter) */ get_property(code_point, UNICODE_DATA, UNICODE_GENERAL_CATEGORY); /* get the script (such as "Greek") */ get_property(code_point, UNICODE_SCRIPTS, UNICODE_SCRIPT); /* get the uppercase */ get_property(code_point, UNICODE_DATA, UNICODE_LOWERCASE); /* get the foldcase */ get_property(code_point, UNICODE_CASEFOLDING, UNICODE_FOLDCASE); typedef enum { GET_PROPERTY_ERROR_INTERNAL = 0, GET_PROPERTY_ERROR_TABLE = -1, GET_PROPERTY_ERROR_PROPERTY = -2 } PARROT_STRING_GET_PROPERTY_ERROR; =over 8 =item * In case of an internal error, (void*)GET_PROPERTY_ERROR_INTERNAL is returned. =item * If there is no said table, (void*)GET_PROPERTY_ERROR_TABLE is returned. =item * If there is no said property, (void*)GET_PROPERTY_ERROR_PROPERTY is returned. =item * In all other cases (such as the name of the script at the code point, or the lowercase version of a code point), a pointer to the value is returned. Such values are read-only so if you want to modify them you must explicitly copy them. If the value is a Unicode string, a pointer to a PARROT_SIZED_STRING is returned. =back The loose matching of property names (for example as described in the Unicode Technical Report #18, I<Unicode Regular Expression Guidelines>, http://www.unicode.org/unicode/reports/tr18/ ), is suggested to be delegated to higher level APIs. =back =head2 Normalization Normalization is a process where equivalent forms of characters are first decomposed and then composed to have equivalent binary representations (since a single character may be represented in several different ways). The normalized forms are useful for string equivalence comparisons, but one probably cannot automatically convert everything to a canonical normalized form because then the I/O equivalence (you read in data and write out data: are the data identical?) could easily be broken. typedef enum { UNICODE_NFD, UNICODE_NFC, UNICODE_NFKD, UNICODE_NFKC } UNICODE_NORMALIZATION; typedef struct { UNICODE_NORMALIZATION normalization_form; PARROT_SIZED_STRING sized_string; } PARROT_NORMALIZED_STRING; B<The preferred normalization form of Parrot is the NFKC.> NKFC stands for Normalization Form KC: Compatibility Decomposition, followed by Canonical Composition. Fast low-level transforms between the normalization forms are required. Note that any string operations should primarily update the primary C<sized_string> of a C<parrot_string> and invalidate the C<normalized_string> by first releasing the storage for the normalized string and then setting the C<normalized_string> pointer in the primary string to a NULL pointer. When the normalized version of the string is called for, the normalized version will be computed and cached on-demand. For more information see the Unicode Technical Report #15, I<Unicode Normalization Forms>, http://www.unicode.org/unicode/reports/tr15/ =head2 Collation In Unicode collation (sorting) typedef struct { INTVAL* (*collation_element)(INTVAL code_point); INTVAL size; INTVAL* sort_key; } PARROT_COLLATED_STRING; The C<collation> element function pointer is used to return (at least) three-element weight vectors for code points, and those weights are then used to compute the C<sort_key>, which is an array of unsigned integers. The sort keys can then be binary-compared to produce the correct comparison. The three (or more) levels of weighing can be customised. For proper implementation of Unicode collation, Unicode normalization (decomposition to canonical mappings) is required. For more information see the Unicode Technical Report #10, I<Unicode Collation Algorithm>, http://www.unicode.org/unicode/reports/tr10/ Similarly to normalization, any string operations should primarily update the C<sized_string> of a C<parrot_string> and invalidate the C<sort_key_string> by first releasing the storage for the sort key and then setting the C<collated_string> pointer in the primary string to a NULL pointer. When the collated version of the string is called for, the sort key will be computed and cached on-demand. =head1 OUTSIDE WORLD Any other character repertoires than Unicode or any other Unicode encodings than UTF-8, UTF-16, and UTF-32 must be converted to Parrot strings. This "must" includes even the common US-ASCII and ISO 8859-1 (Latin-1). For true 7-bit US-ASCII the task is luckily easy because the 128 characters of US-ASCII are identical to the UTF-8 encoded Unicode. But for any eight-bit CES such as the ISO 8859, the Windows code pages, KOI8-R, or EBCDIC, a conversion is necessary. Note that UTF-EBCDIC is a CES, not a CEF, and therefore UTF-EBCDIC should not be used for internal storage, only for external storage and I/O. The main reason for UTF-EBCDIC is for easy interchange of Unicode data between EBCDIC systems, for more information see the Unicode Technical Report #16, I<UTF-EBCDIC>, http://www.unicode.org/unicode/reports/tr21/ The method of conversion to and from Parrot strings is expected to be outside the Parrot core, in the source code parsing and in the I/O subsystem. One notable thing is that if using UTF-16 or UTF-32 the I/O subsystem should automatically either recognise (if reading) or generate (if writing) the appropriate BOMs (byte order markers) and then write the stream appropriately either in BE or LE (big endian or little endian). For UTF-8 the use of the BOM is optional but recommended. (--can't find anything for or against this on the Unicode web site but at least having a BOM would be consistent--) =head1 TO DO Not much thought has been given to string searching or pattern matching interfaces. Boyer-Moore will work just fine with UTF-8 since no false matches can happen thanks to the design of UTF-8. About the implementation of character classes: since the Unicode code point range is big, a single big bitmap won't work any more: firstly, it would be big. Secondly, for most cases, it would be wastefully sparse. A balanced binary tree of (begin, end) points of ranges is suggested. That would seem to give the required flexibility and reasonable compromise betwen speed and space for implementing the operations required by both the traditional regular expression character classes (complement, case-ignorance) and the new Unicode character class semantics (difference, intersection) (see the Unicode Technical Report #18, I<Unicode Regular Expression Guidelines>, http://www.unicode.org/unicode/reports/tr18/ ) Another, possible simpler way would be to use inversion lists: 1-dimensional arrays where odd (starting from zero) indices store the beginnings of ranges belonging to the class, and and even indices store the beginnings of ranges not belonging to the class. Note "array" instead of (a linked) "list": with an array one can do binary search to determine membership (so an inversion list is in effect a flattened binary tree). Yet another way would be to use various two-level table schemes. The choice of the appropriate data structure, as always, depends on the expected operational (read vs modify) mix and the expected data distribution. Especially little thought has been given to the new scheme of having regular expression opcodes as part of the core opcodes. The next_character() and previous_character() are possibly of some use (note that a general random access naturally won't work in variable length data like UTF-8). Some additional notes: beware of the 1:N and N:1 effects of upper/lowercasing casefolding: any assumptions/optimisations concerning the expected (minimum) length of matches, and even the apparent length of the target string get more complicated. A single character in a literal or a character class may match more than one target character; and vice versa. Look-ahead and look-behind need also to consider this variability. If IBM ICU is picked as the low-low level implementation of Unicode (and as the source of locale data), this whole document needs to be sanity checked against the ICU architectures and APIs. Not much thought has been given to the interplay of encodings and I/O, converting encodings on the fly, or the related newline conventions, those issues have been delegated to the source code parsing and I/O subsystem discussions. One notable thing about these, though, is that there should be at least two levels of "name resolution". The lower level should accept only one form of a name, such as "ISO 8859-1", but the higher level should accept looser matching, such as "iso8859_1", or "Latin 1", or "latin-1", and so forth. Some encodings like the CJKV might require extra hooks for automagic detection of the encoding. No thought has been given to the effects of either reference counting or garbage collection. Not much thought has been given into multithreaded operation where Parrot text might be shared between different threads. The design needs to be validated from that viewpoint. The C<index_offset> of PARROT_SIZED_STRING is the most likely victim. =head1 FUTURE It is important to remember that Unicode is an evolving standard: new characters get added through a formal process, errors in character properties gets fixed, and so forth. Therefore it is of utmost importance to track the development of the Unicode process and keep the Parrot implementation of Unicode current.