String rationale

Dan Sugalski Thu, 25 Oct 2001 08:48:45 -0700

'Kay, here's the string background info I promised. If things are missing 
or unclear let me know and I'll fix it up until it is.



==================Cut here with a very sharp knife===============
=head1 TITLE

A parrot string backgrounder

=head1 Overview

Strings, in parrot, are compartmentalized, the same way so much else
in Parrot is compartmentalized. There's no single 'blessed' string
encoding--the closest we come is Unicode, and only as an encoding of
last resort. (Unicode's not a good interchange format, as it loses
information)

=head2 From the Outside

On the outside, the interpreter considers strings to be a sort of
black box. The only bits of the interpreter that much care about the
string data are the regex engine parts, and those only operate on
fixed-sized data.

The interpreter can only peek inside a string if that string is of
fixed length, and the interpreter doesn't actually care about the
character set the data is in. All character sets must provide a way to
transcode to Unicode, and all character encodings must provide a way
to turn their characters into fixed-sized entities. (The size may be
8, 16, or 32 bits as need be for the character set)

Character sets may provide a way to transcode to non-Unicode sets, for
example from EBCDIC to ASCII, but this is optional. If none is
provided a transcoding from one set to another will use Unicode as an
intermediate form, complete with potential data loss.

All character sets must provide the character lists the regular
expression engine needs for the base character classes. (space, word,
and digit characters) This permits the regular expression code to
operate on the contents of a string without needing to know its actual
character set.

=head2 From the Inside

=head2 Technical details

The base string structure looks like:

   struct parrot_string {
     void *bufstart;
     INTVAL buflen;
     INTVAL bufused;
     INTVAL flags;
     INTVAL strlen;
     STRING_VTABLE* encoding;
     INTVAL type;
     INTVAL lanugage;
   }


=head2 Fields

=over 4

=item bufstart

Where the string buffer starts

=item buflen

How big the buffer is

=item bufused

How much of the buffer's used

=item flags

A variety of flags. Low 16 bits reserved to Parrot, the rest are free
for the string encoding library to use

=item strlen

How long the string is in code points. (Note that, for encodings that
are more than 8 bits per code point, or of variable length, this will
E<not> be the same as the buffer used.

=item encoding

Pointer to the library that handles the string encoding. Encoding is
basically how the stream of bytes pointed to by C<bufstart> can be
turned into a stream of 32-bit codepoints. Examples include UTF-8, Big
5, or Shift JIS. Unicode, Ascii, or EBCDIC are B<not> encodings.first

=item type

What the character set or type of data is encoded in the buffer. This
includes things like ASCII, EBCDIC, Unicode, Chinese Traditional,
Chinese Simplified, or Shift-JIS. (And yes, I know the latter's a
combination of type and encoding. I'll update the doc as soon as I can
reasonablty separate the two)

=item language

The language the string is in. This is essential for proper sorting,
if a sort function wants to be language-aware. Just an encoding/type
is insufficient for proper sorting--for example knowing a string is
UTF-32/Unicode doesn't tell you how the data should be ordered. This
is especially important for those languages that overlap in the
Unicode code space. Japanese and Chinese, for example, share many of
the Unicode code points but sort those code points differently.

=back

Libraries for processing character sets and encodings are shareable
libraries, and may be loaded on demand. They are looked up and
referenced by name. An identifying number is given to them at load
time and shouldn't be used outside the currently running
process. (EBCDIC might be character set 3 in one run and set 7 in
another)

The native encoding and character set is I<never> considered a 'real'
encoding or character set. It just specifies what the default is if
nothing else is specified, but when bytecode is frozen to disk the
actual encoding or set name will be used instead.

                                        Dan

--------------------------------------"it's like this"-------------------
Dan Sugalski                          even samurai
[EMAIL PROTECTED]                         have teddy bears and even
                                      teddy bears get drunk

String rationale

Reply via email to