Do please note that whatever objection you may have to this has at least three people who disagree differently, and one or more (who aren't me) who agree with what you disagree with. Also note that I'm not entirely happy with this either.
Consider it an exercise in group coping--we will all deal with it and make do. All complaints, *including* mine, shall be summarily binned, with extreme prejudice. And yes, this means I won't be complaining about Unicode any more.
++++++++++++++++++++++Cut Here++++++++++++++++++++++++ Strings, the final design document
Requirements ============
* Efficiency - The system must do the absolute minimum amount of work to get the job done
* Correctness - The job that's done must actually be right
* Upgradeability - This stuff's all going to change again in five years so we really don't want to have to do it over again.
* Flexibility - Since, unfortunately, no one way of looking at strings is going to be right for everyone
Realities =========
* There are a lot of different ways of representing text. Many of them annoying, some of them wildly incompatible, none of them wrong.
* We don't get to make the call what is right or wrong
* Some of the languages we support don't do Unicode, or do Unicode and other things (including perl 5 and Ruby)
Desires =======
* We want to make it easily possible to do the right thing with string data
* We want all the troublesome stuff to be as invisible as possible
* We want to make it look like everyone's got what they want without actually doing it when we don't have to
With that list in mind, here's parrot's solution. Please note that the *only* thing up for discussion is a more correct label for 'grapheme'. It is, otherwise, the final external design.
Definitions ===========
BYTE - 8 bits 'o data
CODE POINT - A 32-bit integer that represents a single thing in a character set
ENCODING - How code points are mapped to bytes, and vice versa
CHARACTER SET - Contains meta-information about code points. This includes both the meaning of individual code points (65 is capital A, 776 is a combining diaresis) as well as a set of categorizations of code points (alpha, numeric, whitespace, punctuation, and so on), and a sorting order.
GRAPHEME - One or more code points which makes up a single real entity. The "oe" (I'm stuck with ASCII here, that should really be an o with two dots over it) in Leo's last name is, in the unicode character set, a single character with two code points, 111 (lowercase o) and 776 (combining diaresis). Graphemes can *not* be legitimately decomposed into individual code points in most cases.
Important note ==============
This document is completely language-insensitive--that is, there's no language attached to any particular piece of data. Collation and casing rules are done based on a single global setting that is unconditionally applied in all cases. Setting and querying those rules is beyond the scope of this document.
Conceptually ============
The smallest unit of text that Parrot will process is the string, something that can be put in an S register. These strings have the following properties:
*) They have an encoding *) They have a character set *) They have a taint status
The above things are independent of the view of the string presented to bytecode programs--these are metadata elements that describe the contents of the string as they actually exist, rather than as they are presented.
Internally parrot is capable of maintaining strings in several different basic encodings (8-bit, 16-bit, and 32-bit integer, as well as UTF-8) and may load other encodings on the fly as needed. Parrot so also capable of maintaining strings in many different character sets (ASCII, EBCDIC, Unicode, Latin-n, etc) which are also dynamically loadable. Finally Parrot is capable of maintaining strings in many different languages, which also may be loaded on the fly.
This is done for maximum efficiency, regardless of the view of the data presented to the bytecode programs. Conversion to a different format may be done if needed to properly express the semantics of the program, but will not be done if not needed.
For example, consider the following:
use Unicode; open FOO, "foo.txt", :charset(latin-3); open BAR, "bar.txt", :charset(big5); $filehandle = 0; while (<>) { if ($filehandle++) { print FOO $_; } else { print BAR $_; } $filehadle %= 2; }
Relatively simple, the program reads from the input filehandle and splits the data, line by line, between two output files. The two output files have different requirements -- FOO gets data in Latin-1, while BAR gets it in Big5. The "use Unicode;" thing at the top's a hand-wavey way of asserting that we want full Unicode text semantics.
Even so, there's no actual reason in this program to convert to Unicode at all. If the input file is either Latin-3 or Big5, half of the lines read don't have to be converted to anything. If the input file's a proper subset of both (like, US ASCII) then none of the lines read in need any conversion at all.
If Parrot forced all input data to be converted to Unicode internally then this program would potentially have some significant overhead, depending on the type of the input file. Given the output, the input is likely either Latin-3 or Big5, either of which needs some conversion to get turned into Unicode, while Unicode is guaranteed to need some conversion for proper output to both files.
Synthesized code points =======================
Parrot provides code points for all graphemes, even for those character sets/encodings which don't inherently do so. Most sets that have variable-length encodings use an escape sequence scheme--the value of the first byte in a character determines whether the grapheme is a one or more byte sequence. When parrot turns these into code points it does it by building up the final value. The first byte is put in the low 8 bits of the integer. If there's a second byte in the sequence the current value is shifted left 8 bits and the new byte is stuffed in the low 8 bits. If there's a third byte in the sequence everything is shifted left again 8 bits and that third byte is stuffed in the bottom, and so on.
For example, in Shift-JIS, if the frst byte is in the range 0x21-0x7E or 0xA1-0xDF the grapheme is a single byte. If the first byte is in the range 0x81-0x9F or 0xE0-0xEF the grapheme takes two bytes, with the first byte determining which table the second byte indexes into. The roman grapheme A is represented by a single byte 0x41, while the Japanese hiragana KA is represented by the byte sequence 0x82 0xA9. When parrot turns this into code points, it becomes two integers, 0x00000041 and 0x000082A9. (Though it could represent them as 16-bit integers, since no character takes three or more bytes)
While this is somewhat unconventional, it makes the text easy to process internally as fixed-width integers, is trivally transformed back into a byte stream, and trivially turned from a byte stream into integers in the first place. It also has the advantage of making what was a variable-width encoding (some of which make it difficult or impossible to tell, if you pick a byte at a random spot in the byte stream, whether you're in the middle of a grapheme or not) into a fixed-width encoding. As such it makes a reasonably pleasant way to manipulate this sort of text.
Conversion Rules ================
There are two types of conversions, from one thing (encoding, charset, or langauge) to a thing of a similar type or to a thing of a different type.
Similar here means a thing where the conversion is lossless or accepted as good enough to have no semantic loss--for example converting US ASCII to most character sets, or pretty much any character set to Unicode. Different here means a thing where the conversion is *not* guaranteed lossless--for example converting from Shift-JIS to US ASCII or from Unicode to Latin-1.
Conversion lossiness is guaged either as a potential loss (where data *may* be lost) or actual loss (where data, after conversion, *has* been lost). While, for example, Big5 and Shift-JIS aren't interchangeable in general so there is potential loss, they both have US ASCII as a subset so it's possible that the conversion won't actually lose any information.
Current interpreter settings determine when an exception or warning is thrown. Some languages may deem it an error to implicitly shift to an encoding where data may be lost and throw an error any time that happens, others may defer the error until actual data loss occurs, and still others may decide that data loss is fine, since if you were worried about it in the first place you would've done something about it.
Conversions are not required nor guaranteed to be symmetric. Just about everything can shift to Unicode, and US ASCII can shift to just about anything, but the converse is not true.
Since maintaining a full set of conversions is untenable, Parrot declares that, by definition, all sets can pivot through Unicode. Unicode pivoting is considered a potential loss of data, so if the interpreter is set to warn or throw exceptions on potential loss it will do so, even if the conversion is actually OK. (In which case someone had better note that somewhere) It's perfectly acceptable (and, in fact, encouraged) for a set to declare that it can explicitly pivot to another set, with the actual internal code first going through Unicode.
Internals =========
Internally all strings are tagged with an encoding, a charset, and a taint status. This is the minimum amount of information that can be reasonably kept for a string without losing enough information to damage it if the data is passed into a subroutine which expects a string parameter rather than a full-blown PMC.
Tainting status is the simplest thing here, maintainable with a single bit in the flags word for the string. We have to maintain this so that the sequence:
set S0, P0 set P0, S0
doesn't lose the taint status of the data in P0, as well as so this:
set S0, P0 some_sub(S0)
passes in a properly tainted string to the some_sub subroutine. We're encouraging code to use values of the lowest possible type, but we don't want to be sacrificing safety for it.
Encoding needs to be attached to each string so we have some idea of how to turn the bytes in the string's buffer into actual code points. Since we defer transforming the string data until we actually need to use it, regardless of what logical structure we may think the string has, we still need to work on the actual structure it has. This also allows easier processing of data in an encoding different than whatever parrot may take as 'normal', if it ever does. Each character set will have a preferred encoding, but people are going to want to shift encodings around at times. (Especially the various utf-N encodings)
Character set is attached so we can tell what to do with the code points that come from the encoding and how to classify them. While we prefer Unicode, that doesn't mean we're actually *in* unicode yet. Also, since the possibility exists that we may at least have two different character sets (either Unicode or binary, even if we declare there are no others) it's less error-prone to unconditionally use the set information hanging off the string itself.
Core functionality ==================
The following functions need to be performed by the core:
*) Transform encodings *) Transform character sets *) Get/set byte, code point, and grapheme from a string *) Get/set substring *) get length in bytes, code points, and grapheme *) Get/Set encoding *) Get/Set character set *) flatten to and thaw from a binary string *) Upcase, downcase, and titlecase
These are all unary operations. While binary operations are necessary for actual use, we'll deal with them after we get basic string manipulation working.
Opcodes =======
The following ops are proposed. Note that for many of them there is a string-native version and a Unicode version--this is noted by a (u). For Unicode strings these will behave identically, while for strings that aren't in unicode they perform the operation and translate to or from unicode as necessary.
getbyte Ix, Sy, Iz (u)getcodepoint Ix, Sy, Iz (u)getgrapheme Sx, Sy, Iz
Get the byte, codepoint, or grapheme requested. Destination is either an integer (representing the byte or codepoint) or a string. Sy is the source string, Iz is the offset in bytes, code points, or graphemes from the beginning of the string.
(u)getstring Sw, Sx, Iy, Iz
This is substr, with the destination guaranteed to be in Unicode for the (u) case.
setbyte Sx, Iy, Iz (u)setcodepoint Sx, Iy, Iz (u)setgrapheme Sx, Sy, Iz
Sets the byte, code point, or grapheme at offset Z in source string X to the value in Y. Note that in the unicode case the source is taken to be a unicode code point or grapheme and translated to the type of the destination string. These opcodes may throw an exception if the resulting destination string is illegal (for example if the destination is a unicode string with illegal combining character construction, or in the byte case if the resulting buffer is un-decodable)
(u)setstring Sw, Sx, Iy, Iz
This is lvalue substr--the graphemess at offset Y, count Z (NB *graphemes*, not code points) are replaced by the string X. In the unicode case the string is taken to be unicode and translated to the type of the destination string
encoding Ix, Sy charset Ix, Sy
Returns the encoding or character set of Y.
encodingname Sx, Iy charsetname Sx, Iy
Returns the name of the encoding or character set that corresponds to the internal value Y. (As returned by the encoding and charset ops)
findencoding Ix, Sy findcharset Ix, Sy
Find the internal value for the encoding or character set named Y.
bytelength Ix, Sy codepointlength Ix, Sy graphemelength Ix, Sy
Return the length of Y in bytes, code points, or graphemes. Length is actual length, and as such may vary for otherwise identical strings. (This is especially true for strings that change encoding, as lengths can vary wildly between a UTF-8 and UTF-32 version of the same unicode string)
transcode Sx, Iy transset Sx, Iy
Change the string to have the specified encoding, language, or character set. Done in place
transcode Sx, Sy, Iz transset Sx, Sy, Iz
Generate a new version of Y with the encoding or character set Z.
tounicode Sx tounicode Sx, Sy
Change the string to unicode. The one arg version does it in place, the two arg version generates a new string.
upcase Sx upcase Sx, Sy downcase Sx downcase Sx, Sy titlecase Sx titlecase Sx, Sy
Make the string all uppercase, all lower case, or titlecase the first grapheme. The two-arg versions generate a new string, the one arg version does it in place.
decompose Sx, Sy
Take the string in Y and return a version in X which is a flat byte string with no language, character set, or encoding. (or, rather, the charset none, and encoding 8-bit binary)
compose Sw, Ix, Iy
Take the flattened binary string W and mark it as having the encoding X, character set Y. This may throw an exception if the string doesn't meet the requirements of the charset, or encoding.
compose Sv, Sw, Ix, Iy
As above, only a new string is generated and the original left alone.
Exceptions ==========
Here's a list of the exceptions that will be thrown if the string subsystem comes across things its not happy about. All of these exceptions are optional, and may be overridden by interpreter settings. Additionally, some conversions are deemed less dangerous than others, and as such there are two different types of conversion (similar and dissimilar) rather than just one. These exceptions may also be thrown either because of potential problems (where something might happen) or actual problems (where something did happen).
* CHARSET_MISMATCH - thrown whenever a binary operation is done on strings of different character sets.
* LOSSY_CONVERSION - Thrown whenever a conversion would lose information. This includes getting a plain string from a PMC which has segmented string data in it. (This would be a PMC which has some data in Unicode, EBCDIC, and RAD-50, for example, or whose contents had different languages attached to different parts of the string data)
* DECOMPOSITION_ERROR - Thrown whenever you try and act on part of a multi-code point grapheme. This includes doing an ord() on a string where the grapheme you're ord'ing is made up of two or more code points.
-- Dan
--------------------------------------it's like this------------------- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk