Plans for string processing

Dan Sugalski Mon, 12 Apr 2004 08:44:28 -0700

Okay, I've not dug through all the fallout from the ICU checkin, but I can see there's an awful lot. I'll dig through that in a bit, but...

Here's the plan. We've gone over it in the past, but I'm not sure everything's been gathered together, so it's time to do so.

Some declarations:

1) Parrot will *not* require Unicode. Period. Ever. (Well, upon release, at least) We will strongly recommend it, however, and use it if we have it 2) Parrot *will* support multiple encodings (the bytes->code points stuff), character sets (code points->meaning of a sort), and language-specific overrides of character set behaviour. 3) All string data can be dealt with as either a series of bytes, code points, or characters. (Characters are potentially multiple code points--basically combining character stuff from those standards that do so) 4) We will *not* use ICU for core functions. (string to number or number to string conversions, for example) 5) Parrot will autoconvert strings as needed. If a string can't be converted, parrot will throw an exception. This goes for language, character set, or encoding. 6) There *may* be an overriding set of rules for throwing conversion exceptions. (They may be supressed on lossy conversions, or required for any conversions) 7) There *may* be an overriding language used for language-specific operations (case folding or sorting).

I know ICU's got all sorts of nifty features, but bluntly we're not going to use most of them.

The original split of encoding, character set, and language is one that I want to keep. I know we've lost a good chunk of that with the latest ICU patch, but that's only temporary and the breakage is worth it to get Unicode actually in use. I expect I need to step up to the plate and get an alternate encoding and charset in, so I'll probably take a shot at JIS X 0208:1997 or CNS11643-1992. (Or whatever the current version of those is)

As far as Parrot is concerned, a string is a series of bytes which may, via its encoding, be turned into a series of 32 bit integer code points. Those 32-bit integer code points can be turned, via its character set, into a series of characters where each character is one or more code points. Those characters may be classified and transformed based on the language of the string.

The responsibilities of the three layers are:

Encoding
========

*) Transform stream of bytes to and from a set of 32-bit integers *) Manages byte buffer (so buffer positioning and manipulation by code point offset is handled here)

Character set ============= *) Provides default manipulation and comparison behaviour (sorting and case mangling) *) Provides default character classifications (digit, word char, space, punctuation, whatever) *) Provides code point and character manipulation. (substring functionality, basically) *) Provides integrity features (exceptions if a string would be invalid)

Language ======== *) Provides language-sensitive manipulation of characters (case mangling) *) Provides language-sensitive comparisons *) Provides language-sensitive character overrides ('ll' treated as a single character, for example, in Spanish if that's still desired) *) Provides language-sensitive grouping overrides.

Since examples are good, here are a few. They're in an "If we"/"Then Parrot" format.

IW: Mush together (either concatenate or substr replacement) two strings of different languages but same charset TP: Checks to see if that's allowed. If not, an exception is thrown. If so, we do the operation. If one string is manipulated the language stays whatever that string was. If a new string is created either the left side wins or the default language is used, depending on the interpreter setting.

IW: Mush together two strings of different charsets TP: If the two strings can be losslessly converted to one of the two charsets, do so, otherwise transform to Unicode and mush together. If transformation is lossy optionally throw an exception (or warning) Language rules above still apply.

IW: Force a conversion to a different character set TP: Does it. An exception or warning may be thrown if the conversion is not lossless.

Please note that in most cases parrot deals with string data as *strings* in S registers (or hiding behind PMCs) not as integers in I registers (even though we treat strings as a series of abstract integer code points). This is because even something as simple as "give me character 5" may return a series of code points if character 5 is a combining character set. We may (possibly, but possibly not) get a bit dirtier for the regex code for speed reasons, but we'll see about that.

Also note that some languages, such as perl 6, have a more restricted view of things. That's fine, but we don't really care much as long as everything that they need is provided, so the fact that Larry's mandated the Ux levels is fine, but as they're a (possibly excessively) restricted subset of what we're going to do means we can, and in fact should (as they're more restrictive) ignore them for our purposes. Same goes for other languages that have similar restrictions.

Finally note that, in general, the actual character set or language of a string becomes completely irrelevant so there isn't any loss in abstracting things--to properly support Unicode means abstracting the heck out of so much stuff that supporting multiple encodings and character sets is a matter of switching out table pointers, and as such not particularly a big deal.

Yes, this does mean that some of the recent ICU integration's going to be moved back some, and it means that string data's more complex than you might want it to be, but it already is, so we deal.

This all is not, as of yet, entirely non-negotiable, though I've yet to get a convincing argument for change. -- Dan

--------------------------------------"it's like this"-------------------
Dan Sugalski                          even samurai
[EMAIL PROTECTED]                         have teddy bears and even
                                      teddy bears get drunk

Plans for string processing

Reply via email to