Strings. Finally.

Dan Sugalski Mon, 14 Jun 2004 13:55:34 -0700

The official, 1.0, final version, modulo a more correct name for 'grapheme', or spelling/grammar errors.

Do please note that whatever objection you may have to this has at least three people who disagree differently, and one or more (who aren't me) who agree with what you disagree with. Also note that I'm not entirely happy with this either.

Consider it an exercise in group coping--we will all deal with it and make do. All complaints, *including* mine, shall be summarily binned, with extreme prejudice. And yes, this means I won't be complaining about Unicode any more.

++++++++++++++++++++++Cut Here++++++++++++++++++++++++
Strings, the final design document

Requirements
============

* Efficiency - The system must do the absolute minimum amount of work
  to get the job done

* Correctness - The job that's done must actually be right

* Upgradeability - This stuff's all going to change again in five years
  so we really don't want to have to do it over again.

* Flexibility - Since, unfortunately, no one way of looking at
  strings is going to be right for everyone

Realities
=========

* There are a lot of different ways of representing text. Many of
  them annoying, some of them wildly incompatible, none of them
  wrong.

* We don't get to make the call what is right or wrong

* Some of the languages we support don't do Unicode, or do Unicode
  and other things (including perl 5 and Ruby)

Desires
=======

* We want to make it easily possible to do the right thing with string
  data

* We want all the troublesome stuff to be as invisible as possible

* We want to make it look like everyone's got what they want without
  actually doing it when we don't have to


With that list in mind, here's parrot's solution. Please note that
the *only* thing up for discussion is a more correct label for
'grapheme'. It is, otherwise, the final external design.

Definitions
===========

BYTE - 8 bits 'o data

CODE POINT - A 32-bit integer that represents a single thing in a
             character set

ENCODING - How code points are mapped to bytes, and vice versa

CHARACTER SET - Contains meta-information about code points. This
                includes both the meaning of individual code points
                (65 is capital A, 776 is a combining diaresis) as
                well as a set of categorizations of code
                points (alpha, numeric, whitespace, punctuation, and
                so on), and a sorting order.

GRAPHEME - One or more code points which makes up a single real
           entity. The "oe" (I'm stuck with ASCII here, that should
           really be an o with two dots over it) in Leo's last name
           is, in the unicode character set, a single character with
           two code points, 111 (lowercase o) and 776 (combining
           diaresis). Graphemes can *not* be legitimately
           decomposed into individual code points in most cases.

Important note
==============

This document is completely language-insensitive--that is, there's no
language attached to any particular piece of data. Collation and
casing rules are done based on a single global setting that is
unconditionally applied in all cases. Setting and querying those
rules is beyond the scope of this document.

Conceptually
============

The smallest unit of text that Parrot will process is the string,
something that can be put in an S register. These strings have the
following properties:

*) They have an encoding
*) They have a character set
*) They have a taint status

The above things are independent of the view of the string presented
to bytecode programs--these are metadata elements that describe the
contents of the string as they actually exist, rather than as they
are presented.

Internally parrot is capable of maintaining strings in several
different basic encodings (8-bit, 16-bit, and 32-bit integer, as well
as UTF-8) and may load other encodings on the fly as needed. Parrot so
also capable of maintaining strings in many different character sets
(ASCII, EBCDIC, Unicode, Latin-n, etc) which are also dynamically
loadable. Finally Parrot is capable of maintaining strings in many
different languages, which also may be loaded on the fly.

This is done for maximum efficiency, regardless of the view of the
data presented to the bytecode programs. Conversion to a different
format may be done if needed to properly express the semantics of the
program, but will not be done if not needed.

For example, consider the following:

  use Unicode;
  open FOO, "foo.txt", :charset(latin-3);
  open BAR, "bar.txt", :charset(big5);
  $filehandle = 0;
  while (<>) {
    if ($filehandle++) {
      print FOO $_;
    } else {
      print BAR $_;
    }
    $filehadle %= 2;
  }

Relatively simple, the program reads from the input filehandle and
splits the data, line by line, between two output files. The two
output files have different requirements -- FOO gets data in Latin-1,
while BAR gets it in Big5. The "use Unicode;" thing at the top's a
hand-wavey way of asserting that we want full Unicode text semantics.

Even so, there's no actual reason in this program to convert to
Unicode at all. If the input file is either Latin-3 or Big5, half of
the lines read don't have to be converted to anything. If the input
file's a proper subset of both (like, US ASCII) then none of the
lines read in need any conversion at all.

If Parrot forced all input data to be converted to Unicode internally
then this program would potentially have some significant overhead,
depending on the type of the input file. Given the output, the input
is likely either Latin-3 or Big5, either of which needs some
conversion to get turned into Unicode, while Unicode is guaranteed to
need some conversion for proper output to both files.

Synthesized code points
=======================

Parrot provides code points for all graphemes, even for those
character sets/encodings which don't inherently do so. Most sets that
have variable-length encodings use an escape sequence scheme--the
value of the first byte in a character determines whether the
grapheme is a one or more byte sequence. When parrot turns these into
code points it does it by building up the final value. The first byte
is put in the low 8 bits of the integer. If there's a second byte in
the sequence the current value is shifted left 8 bits and the new byte
is stuffed in the low 8 bits. If there's a third byte in the sequence
everything is shifted left again 8 bits and that third byte is stuffed
in the bottom, and so on.

For example, in Shift-JIS, if the frst byte is in the range
0x21-0x7E or 0xA1-0xDF the grapheme is a single byte. If the first
byte is in the range 0x81-0x9F or 0xE0-0xEF the grapheme takes two
bytes, with the first byte determining which table the second byte
indexes into. The roman grapheme A is represented by a single byte
0x41, while the Japanese hiragana KA is represented by the byte
sequence 0x82 0xA9. When parrot turns this into code points, it
becomes two integers, 0x00000041 and 0x000082A9. (Though it could
represent them as 16-bit integers, since no character takes three or
more bytes)

While this is somewhat unconventional, it makes the text easy to
process internally as fixed-width integers, is trivally transformed
back into a byte stream, and trivially turned from a byte stream into
integers in the first place. It also has the advantage of making what
was a variable-width encoding (some of which make it difficult or
impossible to tell, if you pick a byte at a random spot in the byte
stream, whether you're in the middle of a grapheme or not) into a
fixed-width encoding. As such it makes a reasonably pleasant way to
manipulate this sort of text.

Conversion Rules
================

There are two types of conversions, from one thing (encoding,
charset, or langauge) to a thing of a similar type or to a thing of a
different type.

Similar here means a thing where the conversion is lossless or
accepted as good enough to have no semantic loss--for example
converting US ASCII to most character sets, or pretty much any
character set to Unicode.  Different here means a thing where the
conversion is *not* guaranteed lossless--for example converting from
Shift-JIS to US ASCII or from Unicode to Latin-1.

Conversion lossiness is guaged either as a potential loss (where data
*may* be lost) or actual loss (where data, after conversion, *has*
been lost). While, for example, Big5 and Shift-JIS aren't
interchangeable in general so there is potential loss, they both have
US ASCII as a subset so it's possible that the conversion won't
actually lose any information.

Current interpreter settings determine when an exception or warning
is thrown. Some languages may deem it an error to implicitly shift to
an encoding where data may be lost and throw an error any time that
happens, others may defer the error until actual data loss occurs,
and still others may decide that data loss is fine, since if you were
worried about it in the first place you would've done something about
it.

Conversions are not required nor guaranteed to be symmetric. Just
about everything can shift to Unicode, and US ASCII can shift to just
about anything, but the converse is not true.

Since maintaining a full set of conversions is untenable, Parrot
declares that, by definition, all sets can pivot through
Unicode. Unicode pivoting is considered a potential loss of data, so
if the interpreter is set to warn or throw exceptions on potential
loss it will do so, even if the conversion is actually OK. (In which
case someone had better note that somewhere) It's perfectly acceptable
(and, in fact, encouraged) for a set to declare that it can explicitly
pivot to another set, with the actual internal code first going
through Unicode.

Internals
=========

Internally all strings are tagged with an encoding, a charset, and a
taint status. This is the minimum amount of information that can be
reasonably kept for a string without losing enough information to
damage it if the data is passed into a subroutine which expects a
string parameter rather than a full-blown PMC.

Tainting status is the simplest thing here, maintainable with a single
bit in the flags word for the string. We have to maintain this so that
the sequence:

   set S0, P0
   set P0, S0

doesn't lose the taint status of the data in P0, as well as so this:

  set S0, P0
  some_sub(S0)

passes in a properly tainted string to the some_sub subroutine. We're
encouraging code to use values of the lowest possible type, but we
don't want to be sacrificing safety for it.

Encoding needs to be attached to each string so we have some idea of
how to turn the bytes in the string's buffer into actual code
points. Since we defer transforming the string data until we actually
need to use it, regardless of what logical structure we may think the
string has, we still need to work on the actual structure it has.
This also allows easier processing of data in an encoding different
than whatever parrot may take as 'normal', if it ever does. Each
character set will have a preferred encoding, but people are going to
want to shift encodings around at times. (Especially the various utf-N
encodings)

Character set is attached so we can tell what to do with the code
points that come from the encoding and how to classify them. While we
prefer Unicode, that doesn't mean we're actually *in* unicode
yet. Also, since the possibility exists that we may at least have two
different character sets (either Unicode or binary, even if we declare
there are no others) it's less error-prone to unconditionally use the
set information hanging off the string itself.

Core functionality
==================

The following functions need to be performed by the core:

*) Transform encodings
*) Transform character sets
*) Get/set byte, code point, and grapheme from a string
*) Get/set substring
*) get length in bytes, code points, and grapheme
*) Get/Set encoding
*) Get/Set character set
*) flatten to and thaw from a binary string
*) Upcase, downcase, and titlecase

These are all unary operations. While binary operations are necessary
for actual use, we'll deal with them after we get basic string
manipulation working.

Opcodes
=======

The following ops are proposed. Note that for many of them there is a
string-native version and a Unicode version--this is noted by a
(u). For Unicode strings these will behave identically, while for
strings that aren't in unicode they perform the operation and
translate to or from unicode as necessary.

getbyte          Ix, Sy, Iz
(u)getcodepoint  Ix, Sy, Iz
(u)getgrapheme   Sx, Sy, Iz

Get the byte, codepoint, or grapheme requested. Destination is either
an integer (representing the byte or codepoint) or a string. Sy is the
source string, Iz is the offset in bytes, code points, or graphemes
from the beginning of the string.

(u)getstring     Sw, Sx, Iy, Iz

This is substr, with the destination guaranteed to be in Unicode for
the (u) case.

setbyte          Sx, Iy, Iz
(u)setcodepoint  Sx, Iy, Iz
(u)setgrapheme   Sx, Sy, Iz

Sets the byte, code point, or grapheme at offset Z in source string X
to the value in Y. Note that in the unicode case the source is taken
to be a unicode code point or grapheme and translated to the type of
the destination string. These opcodes may throw an exception if the
resulting destination string is illegal (for example if the
destination is a unicode string with illegal combining character
construction, or in the byte case if the resulting buffer is un-decodable)

(u)setstring     Sw, Sx, Iy, Iz

This is lvalue substr--the graphemess at offset Y, count Z (NB
*graphemes*, not code points) are replaced by the string X. In the
unicode case the string is taken to be unicode and translated to the
type of the destination string

encoding Ix, Sy
charset  Ix, Sy

Returns the encoding or character set of Y.

encodingname Sx, Iy
charsetname  Sx, Iy

Returns the name of the encoding or character set that corresponds to
the internal value Y. (As returned by the encoding and charset ops)

findencoding Ix, Sy
findcharset  Ix, Sy

Find the internal value for the encoding or character set named Y.

bytelength      Ix, Sy
codepointlength Ix, Sy
graphemelength Ix, Sy

Return the length of Y in bytes, code points, or graphemes. Length is
actual length, and as such may vary for otherwise identical
strings. (This is especially true for strings that change encoding, as
lengths can vary wildly between a UTF-8 and UTF-32 version of the same
unicode string)

transcode Sx, Iy
transset  Sx, Iy

Change the string to have the specified encoding, language, or
character set. Done in place

transcode Sx, Sy, Iz
transset  Sx, Sy, Iz

Generate a new version of Y with the encoding or character set Z.

tounicode Sx
tounicode Sx, Sy

Change the string to unicode. The one arg version does it in place,
the two arg version generates a new string.

upcase    Sx
upcase    Sx, Sy
downcase  Sx
downcase  Sx, Sy
titlecase Sx
titlecase Sx, Sy

Make the string all uppercase, all lower case, or titlecase the first
grapheme. The two-arg versions generate a new string, the one arg
version does it in place.

decompose Sx, Sy

Take the string in Y and return a version in X which is a flat byte
string with no language, character set, or encoding. (or, rather, the
charset none, and encoding 8-bit binary)

compose Sw, Ix, Iy

Take the flattened binary string W and mark it as having the encoding
X, character set Y. This may throw an exception if the string doesn't
meet the requirements of the charset, or encoding.

compose Sv, Sw, Ix, Iy

As above, only a new string is generated and the original left alone.

Exceptions
==========

Here's a list of the exceptions that will be thrown if the string
subsystem comes across things its not happy about. All of these
exceptions are optional, and may be overridden by interpreter
settings. Additionally, some conversions are deemed less dangerous
than others, and as such there are two different types of conversion
(similar and dissimilar) rather than just one. These exceptions may
also be thrown either because of potential problems (where something
might happen) or actual problems (where something did happen).

* CHARSET_MISMATCH - thrown whenever a binary operation is done on
  strings of different character sets.

* LOSSY_CONVERSION - Thrown whenever a conversion would lose
  information. This includes getting a plain string from a PMC which
  has segmented string data in it. (This would be a PMC which has
  some data in Unicode, EBCDIC, and RAD-50, for example, or whose
  contents had different languages attached to different parts of the
  string data)

* DECOMPOSITION_ERROR - Thrown whenever you try and act on part of a
  multi-code point grapheme. This includes doing an ord() on a
  string where the grapheme you're ord'ing is made up of two or more
  code points.

--
                                Dan

--------------------------------------it's like this-------------------
Dan Sugalski                          even samurai
[EMAIL PROTECTED]                         have teddy bears and even
                                      teddy bears get drunk

Strings. Finally.

Reply via email to