'Kay, here's the string background info I promised. If things are missing or unclear let me know and I'll fix it up until it is.
==================Cut here with a very sharp knife=============== =head1 TITLE A parrot string backgrounder =head1 Overview Strings, in parrot, are compartmentalized, the same way so much else in Parrot is compartmentalized. There's no single 'blessed' string encoding--the closest we come is Unicode, and only as an encoding of last resort. (Unicode's not a good interchange format, as it loses information) =head2 From the Outside On the outside, the interpreter considers strings to be a sort of black box. The only bits of the interpreter that much care about the string data are the regex engine parts, and those only operate on fixed-sized data. The interpreter can only peek inside a string if that string is of fixed length, and the interpreter doesn't actually care about the character set the data is in. All character sets must provide a way to transcode to Unicode, and all character encodings must provide a way to turn their characters into fixed-sized entities. (The size may be 8, 16, or 32 bits as need be for the character set) Character sets may provide a way to transcode to non-Unicode sets, for example from EBCDIC to ASCII, but this is optional. If none is provided a transcoding from one set to another will use Unicode as an intermediate form, complete with potential data loss. All character sets must provide the character lists the regular expression engine needs for the base character classes. (space, word, and digit characters) This permits the regular expression code to operate on the contents of a string without needing to know its actual character set. =head2 From the Inside =head2 Technical details The base string structure looks like: struct parrot_string { void *bufstart; INTVAL buflen; INTVAL bufused; INTVAL flags; INTVAL strlen; STRING_VTABLE* encoding; INTVAL type; INTVAL lanugage; } =head2 Fields =over 4 =item bufstart Where the string buffer starts =item buflen How big the buffer is =item bufused How much of the buffer's used =item flags A variety of flags. Low 16 bits reserved to Parrot, the rest are free for the string encoding library to use =item strlen How long the string is in code points. (Note that, for encodings that are more than 8 bits per code point, or of variable length, this will E<not> be the same as the buffer used. =item encoding Pointer to the library that handles the string encoding. Encoding is basically how the stream of bytes pointed to by C<bufstart> can be turned into a stream of 32-bit codepoints. Examples include UTF-8, Big 5, or Shift JIS. Unicode, Ascii, or EBCDIC are B<not> encodings.first =item type What the character set or type of data is encoded in the buffer. This includes things like ASCII, EBCDIC, Unicode, Chinese Traditional, Chinese Simplified, or Shift-JIS. (And yes, I know the latter's a combination of type and encoding. I'll update the doc as soon as I can reasonablty separate the two) =item language The language the string is in. This is essential for proper sorting, if a sort function wants to be language-aware. Just an encoding/type is insufficient for proper sorting--for example knowing a string is UTF-32/Unicode doesn't tell you how the data should be ordered. This is especially important for those languages that overlap in the Unicode code space. Japanese and Chinese, for example, share many of the Unicode code points but sort those code points differently. =back Libraries for processing character sets and encodings are shareable libraries, and may be loaded on demand. They are looked up and referenced by name. An identifying number is given to them at load time and shouldn't be used outside the currently running process. (EBCDIC might be character set 3 in one run and set 7 in another) The native encoding and character set is I<never> considered a 'real' encoding or character set. It just specifies what the default is if nothing else is specified, but when bytecode is frozen to disk the actual encoding or set name will be used instead. Dan --------------------------------------"it's like this"------------------- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk