on parrot strings

Jarkko Hietaniemi Thu, 17 Jan 2002 20:18:13 -0800

Dan suggested that I make this document public.

I've been reluctant to do so, for several reasons, the most important
being that I'm far, FAR, behind in the current Parrot developments.


Therefore, this document will probably rudely stomp right over any
recent string, encoding and regular expression work that has been
going on. Its idea of vtables are just much handwaving, I really
don't know how vtables really work or how they are currently
implemented in Parrot.

However, I have spent some serious thought on this, based on the
appropriate amount of fun I've had lately with the Unicode in the
upcoming Perl 5.8, and some familiarity with the Unicode standard.

The most important message is that give up on 8-bit bytes, already.
Time to move on, chop chop.

The document itself is a collection (what we experts technically call
"a mess") of theory and practice: character model and statements about
how things should be, and how things should probably be implemented.
I've asked Larry about whether the ICU license would be compatible
with whatever Parrot will have, but he's currently cruising...

The bad (or good?) news is that I don't have the time to work
on this design document much more, maintain or develop it,
and certainly not implement any of it (another reason why I
have been reluctant to publish this).  But that's how it is.

-- 
$jhi++; # http://www.iki.fi/jhi/
        # There is this special biologist word we use for 'stable'.
        # It is 'dead'. -- Jack Cohen

=head1 TITLE

Parrot String Handling

=head1 VERSION

    $Id: parrotstring.pod,v 1.16 2002/01/18 03:57:07 jhi Exp $

=head2 Current

    Maintainer: Some Body
    Class: Internals
    PDD Number: X
    Version: 1.0
    Status: Developing
    Last Modified: 
    PDD Format: 1
    Language: English

=head2 History

None.  First version.  (Jarkko Hietaniemi)

=head1 CHANGES

None.  First version.

=head1 ABSTRACT

This PDD specifies how Parrot should handle characters and text,
the character encoding model, and the basic operations.

=head1 DESCRIPTION

Since discussion of character sets, encodings, and the like quite
often confuses people considerably, let's first define some terms for
our character encoding model.  The terminology follows the Unicode
Technical Report #17, I<Character Encoding Model>,
http://www.unicode.org/unicode/reports/tr17/ .  The character
encoding model consists of five layers:

=head2 Terminology

=over 4

=item I<ACR>

I<Abstract Character Repertoire> - the (unordered) set of characters
to be encoded.  Unicode defines one (open-ended) repertoire, the ISO
8859 are examples of (closed) repertoires.

=item I<CCS>

I<Coded Character Set> - a mapping from an abstract character
repertoire to a set of non-negative integers known as I<code points>.
This is how the Unicode C<LATIN CAPITAL LETTER A> is mapped to U+0041.
Alternative names for CCSs are I<character encoding>, a I<coded
character repertoire>, a <character set definition>, or a <code page>.
Examples include Unicode, ISO 8859, the Windows code pages, and the
JIS X 0208.  The code point space (from zero to some positive number)
can be flat or complex (some ranges may be unused or forbidden).

=item I<CEF>

I<Character Encoding Form> - a mapping from a set of non-negative
integers (a CCS) to set of sequences of particular I<code units> of
some specified (computer-architecture imposed) width, such as bytes.

CEFs can be fixed-width or variable-width.  Note that (especially
historically) in many cases CCS and CEF have used the same (often
byte) fixed width, so they have been practically identical.  But don't
let that fool you.

A byte sequence can be either I<assigned> (it represents a valid
and assigned character in the code point space), I<unassigned>
(it represents a valid but unused code point), or I<illegal>
(for some reason forbidden).

Examples of fixed-width CEFs include 7-bit, 8-bit, 16-bit, and 32-bit.
Note that these are independent of any particular CCS.  8-bit CEF can
be used for encoding several different CCSs, such as the ISO 8859 and
the different EBCDICs.

Examples of variable-width CEFs include UTF-8 and UTF-16.

=item I<CES>

I<Character Encoding Scheme> - a mapping from a set of sequences
of code units (from one or more CEFs) to a serialised sequence
of bytes.  Serialisation is important for transmission and storage.

It's important not to confuse a CEF and a CES.  A CEF maps code points
to code units (numbers) but a CES maps code units to serialised
sequences of bytes.  Therefore a CES must take into account
byteordering.  A CES may also include more than one CCS, and it may
include non-textual data such as shift sequences.

Examples of CESs include UTF-8 (yes, UTF-8 is both a CEF and a CES),
UTF-16BE, and UTF-16LE from Unicode, and the Asian ISO 2022 and EUC.

=item I<TES>

I<Transfer Encoding Syntax> - a reversible transform of encoded data,
which may or may not contain textual data.  These include various
Internet or other transmission/storage transformation to either
encapsulate/protect data for transmission or to compress it.

Examples include base64, uuencode, quoted-printable, zip, bzip2.

=back

In addition to the above: I<CM> - I<Character Map> - is a mapping that
combines ACR, CCS, CES and a CEF.  Character Maps are what in the IAB
architecture get the IANA charset identifiers.  TES is assumed to lie
beyond the Parrot core or I/O.

=head2 Parrot's Character Encoding Model

Parrot will use the Unicode encoding mode model layers ACR, CCS, CEF,
and CES; that is, use the Unicode standard as an abstract character
repertoire, coded character set, character encoding from, and
character encoding scheme.

B<Any code points in Parrot will be Unicode code points.>

All methods of creating strings (in Perl: chr(), \xHH, \x{HH...},
\N{...}, pack) should create Unicode, and all methods of converting
strings to code points (in Perl: ord(), and unpack()), should generate
Unicode code points.

=head2 Of Bits and Bytes

There is no string type built out of native eight-bit bytes.

One may need such a data type, though, for representing bit strings
(in Perl: vec(), and the associated bitstring operators), and maybe
also for completely opaque binary objects ("no user serviceable parts
inside"), though especially for the latter some other data type than a
"string" may be preferable.  Because a bitstring has no ACS or CCS,
nor CES, it is just a CEF without any character coding.

Of course, traditionally a bit pattern like 000001001 has been
associated with, say, "A", that is to considered purely an accident of
the implementation.  It is suggested that for Parrot the bitstring data
type is completely dissociated from the string type, and trying to mix
the two should be considered an error (except maybe for data-type
crossing interfaces, like Perl's pack() and unpack()).  Bitstrings
have a rich array of their own operations, and it is also suggested
that the "libc" equivalent of Parrot contain efficient implementations
of these bitstring operations.

=head1 IMPLEMENTATION

=head2 Internal Storage

The internal storage of text is based on three layers of types.
The basic type is

        typedef enum {
            UNICODE_CEF_UTF_8,
            UNICODE_CEF_UTF_16,
            UNICODE_CEF_UTF_32,
        } PARROT_CEF;

        typedef struct {
             PARROT_CEF                 CEF;
             UINTVAL                    size;
             char                       data[1];
        } PARROT_STRING;

The C<CEF> defines the character encoding form being used.
The C<size> tells how many bytes are contained in the C<data>.
(The data[1] hack is used to allow for a single-pass allocation
of PARROT_STRINGs.)

B<The preferred CEF for Parrot is UTF-16.>

Though UTF-8 is more space-saving (for the first 0x7FF of the Unicode
CEF) (and used in in many implementations, like Perl 5) it introduces
considerable code complexity if used used as the primary internal
storage format.

Fast low-level transforms between all the CEFs are required.

Note, however, that UTF-16 still is a variable-width encoding,
and relatively complex at that, thanks to the "surrogate discontinuity"
from U+D800..1.U+DFFFF (which enables the UTF-16 to reach the
U+10000...U+10FFFFF code point space required by Unicode 3.1 and
later.)  For implementation efficiency, it may be useful to split
UTF-16 into two subclasses

        UNICODE_CEF_UCS_2
        UNICODE_CEF_UTF_16_SURROGATE

where the UCS-2 would be 2-byte "simple" UTF-16 without surrogates
and the UTF-16 surrogate would potentially contain "complex" 4-byte
surrogate pairs encoding code points beyond U+FFFF.  This, of course,
is premature optimisation, and let's not enter that particular evil
before performance testing strongly suggests we have to.

One possibility would be to use UTF-32 which would be fixed-width and
without code range complexities, but the wasted space might be too
much to bear in normal use.  (However, it would be interesting to have
all the internal string functions implemented also in UTF-32 so that
one could first do space/time benchmarking, and then possibly use
UTF-32 anyway if that turns out to be considerably faster, and one has
the memory to spare.)

The PARROT_STRING as such doesn't know how many characters there are
in the string contained in the C<data>.  For that the following type
is used to wrap the PARROT_STRING:

        typedef struct {
             INTVAL                     size;
             INTVAL                     index;
             INTVAL                     index_offset;
             INTVAL                     last_offset;
             INTVAL                     size_valid:1;
             INTVAL                     offset_valid:1;
             INTVAL                     last_valid:1;
             PARROT_STRING              string;
        } PARROT_SIZED_STRING;

Whenever some computation determines the number of characters in
C<data>, it must update the C<size> field and set the C<size_valid>
boolean.  Whenever some computation changes the C<data> in a way that
changes the number of characters, the computation must also clear the
C<size_valid> boolean.

Whenever some computation determines the first byte of the last
character in C<data>, it must update the C<last_offset> to point to
that first byte, and set the C<last_valid> boolean.  Whenever some
computation invalidates the C<last_offset>, the computation must also
clear the C<last_valid> boolean.

Whenever some computation determines the first byte of the C<index>-th
character in C<data>, it must update the C<index_offset> to point to
the first bytes, and set the C<index_valid> boolean.  Whenever some
computation invalidates the C<index_offset> (changes the length of a
subsequence of C<data> before the C<index_offset>), the computation
must also clear the C<index_valid> boolean.  This is intended to
serve as a very light-weight cache of the character offset.

The C<index_offset> and C<last_offset> are offsets by design,
as opposed to pointers.  This way one cannot accidentally mix
pointers from different allocations; all operations are to be
relative to the C<data>.

The third and final layer of Parrot string is localised string.
Many string operations are dependent on a I<locale>, a collection of
language/country/etc dependent behaviours.  If the C<locale_vtable>
is NULL, the default (locale-independent) behaviours are used.

How locale names are mapped into customised vtable operations is
outside the scope of Parrot.  However, B<it is strongly suggested that
any locale implementations recognise the basic ll_CC syntax>, where
I<ll> is an ISO 639 language code, and I<CC> is an ISO 3169 country
code.  But what is the actual set of behaviours behind, say, C<de_DE>,
C<fr_FR>, or <fr_CA>, has unfortunately not been standardised.  For
the locale data (and therefore behaviours) the use of something widely
available such as the IBM ICU is suggested, see
http://oss.software.ibm.com/icu/

Note that the legacy 8-bit locale implementation supplied by many
systems is basically unusable with Unicode because there are no
portable ways to map between the 8-bit locales and Unicode, and
because the 8-bit legacy locale implementations are not portable
even between themselves.  All the solutions one can possibly
come up with are vendor-dependent.

The contents of the C<locale_vtable> are discussed under L</Operations>.

The C<normalized_string> and C<collated_string> refer to the
Unicode-normalized version of the string and to the Unicode collation
I<sort key>, and they are discussed under L</Normalization> and
L</Collation>.

        typedef struct {
             PARROT_STRING_LOCALE_VTABLE*       locale_vtable;
             PARROT_NORMALIZED_STRING*          normalized_string;
             PARROT_COLLATED_STRING*            collated_string;
             PARROT_SIZED_STRING                sized_string;
        } PARROT_LOCALIZED_STRING;

        typedef PARROT_LOCALIZED_STRING parrot_string;

The C<sized_string> in the above is known as the I<primary string>,
as opposed to the secondary strings, the normalized string and
collated string, which are transformed versions of the primary string
needed only for certain operations.  All string updates should modify
the primary string and invalidate the secondary strings.

Note that there are further candidates for transformed versions of the
primary string, namely the case-changed versions of the string: upper,
lower, title, and fold.  Those would work fine for casing functions,
(in Perl: uc(), lc(), ucfirst(), lcfirst()).  However, for use in
case-ignoring pattern matching simply caching the case-changed string
would not be enough: since in Unicode the casing operations can
transform one character to many, a way to map byte offsets pointing to
the first bytes of characters in either the primary or the case-changed
version to byte offsets in the other string would be needed.
Additional twists are that the "fold" casing is locale-dependent
(certain characters behave differently under Turkish and Azeri), and
in one particular case position-dependent (how the Greek sigma character
behaves depends on its surrounding characters).

=head2 Operations

The operations are assumed to be vtable operations, so the parent
"object" (the string being operated on) is never shown explicitly.

Because all the three CEFs (UTF-8, UTF-16, UTF-32) are valid as
low-level encoding of strings, most, if not all, of these operations
need to be coded in triplicate, one for each CEF (and in quadruplicate,
if the implementation of UTF-16 is split into UCS-2 and "true" UTF-16).

Some error codes:

        typedef enum {
            UNICODE_ILLEGAL_CODE_POINT  = -1,
            BEFORE_STRING_BEGINNING     = -2,
            AFTER_STRING_END            = -3,
            NO_MATCH                    = -4,
        } PARROT_STRING_RETURN_CODE;

B<NOTE THAT THE GREAT CARE MUST BE EXERCISED TO AUDIT THE UNICODE
IMPLEMENTATION FOR SECURITY:> against accepting illegal input,
against buffer overflows, and against generating illegal byte
sequences or code points.  Do pass in the maximum extent of a byte
sequence: just passing in the pointer to the sequence is bad.
Do preserve enough space for the result: for example the case
transformations may require up to I<three> times more space than
the original.  Do not expect to able to pass in a code point
and return a code point: many Unicode operations are 1:n or N:1.
Do not under any circumstances generate illegal UTF-8
(such as surrogates, or 6-byte sequences for Plane 1 characters,
you should be generating 4-byte sequences for those);
warn about receiving illegal UTF-8: illegal UTF-8 byte sequences
are indicative of and conducive to buffer overflows. See
http://www.unicode.org/unicode/uni2errata/UTF-8_Corrigendum.html 
and http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt.

=over 4

=item code_point

        INTVAL code_point(INTVAL offset);

Given a byte offset to the first byte of a character, return the code
point of the character at that offset (a CEF -> CCS conversion).  If
the byte sequence contains an illegal code point (such as a surrogate
in UTF-8), UNICODE_ILLEGAL_CODE_POINT will be returned.

=item next_character

        INTVAL next_character(INTVAL offset);

Given a byte offset to the first byte of a character, return the
offset to the first byte of the next character.  If at the first byte
of the last character, AFTER_STRING_END will be returned.

=item previous_character

        INTVAL previous_character(INTVAL offset);

Given a byte offset to the first byte of a character, return the
offset to the first byte of the previous character.  If at the first
byte of the first character, BEFORE_STRING_BEGINNING will be returned.

=item skip_characters

        INTVAL skip_characters(INTVAL offset, INTVAL skip);

Given a byte offset to the first byte of a character, return the
offset to the first byte of the character skip characters before or
after the first character.  If the skip would go beyond the beginning
or the end of the string, BEFORE_STRING_BEGINNING or AFTER_STRING_END
will be returned.

=item distance_in_characters

        INTVAL distance_in_characters(INTVAL offset_a, INTVAL offset_b);

Given two byte offsets pointing to the first bytes of two characters,
return the distance in characters between the two characters.
If either of the offsets point outside the string,
BEFORE_STRING_BEGINNING or AFTER_STRING_END will be returned.

=item index_of_offset

        INTVAL index_of_offset(INTVAL offset);

Given a byte offset to the first byte of a character, return the index
of the character in characters from the beginning of the string.
character.  If the character would be beyond the end of the string,
AFTER_STRING_END will be returned.

=item offset_of_index

        INTVAL offset_of_index(INTVAL index);

Given a character index, return the bte offset to the C<index>-th
character.

If the character would be beyond the end of the string, AFTER_STRING_END
will be returned.

Index wrapping from the end of the string (so that for example -2
would be the second to last character) is suggested to be delegated to
higher level APIs.

=item get_substring

        INTVAL get_substring(INTVAL index, INTVAL size, parrot_string *dststring);

Given a character index and a character count (size) and a destination
string, set in the destination string the data to point to the data of
the parent string, the offset to a point to the right offset for the
C<index>-th character, and the size of the destination string to
C<size>.  To really make a copy, you need to use copy_string()
on the destination string.

If index would be beyond the end of the string, 0 (zero) will be
returned.  Otherwise, 1 (one) will be returned.

Index wrapping from the end of the string (such as -2 meaning the
second to last character), or size truncation at the end of the
string, are suggested to be delegated to higher lever APIs.

=item set_substring

        INTVAL get_substring(INTVAL index, INTVAL size, parrot_string *srcstring);

Given a character index and a character count (size) and a source
string, set in the parent string the subsequence of C<size> bytes at
the C<index>-th character of destination data to point to the data of
the source string (releasing any possible old destination data), and
the size of the parent string to the updated length.  

If the index or the end of the updated data would be beyond the end of
the string, 0 (zero) will be returned.  Otherwise, 1 (one) will be
returned.

Index wrapping from the end of the string (such as -2 meaning the
second to last character), or size truncation at the end of the
string, or string extension (probably using insert_characters()),
are suggested to be delegated to higher lever APIs.

=item copy_string

        void* copy_string()

Creates a new copy of the data of the string and sets the data
of the string to point to that new copy.  B<Does not release
the old data, that is left for higher level APIs.>  Returns the
pointer to the new data, or the NULL pointer in case of a failure.

=item insert_characters

        INTVAL insert_characters(INTVAL index, INTVAL character, INTVAL size);

Insert C<size> C<character> characters into the string at the
character index C<index>.

If the index or the end of the updated data would be beyond the end of
the string, 0 (zero) will be returned.  Otherwise, 1 (one) will be
returned.

Index wrapping from the end of the string (such as -2 meaning the
second to last character), or size truncation at the end of the
string, are suggested to be delegated to higher lever APIs.

This is one of the only two APIs that modify their parent string,
the other one is delete_characters().

=item delete_characters

        INTVAL delete_characters(INTVAL index, INTVAL size);

Delete C<size> zero characters from the string at the character index
C<index>.

If the index or the end of the updated data would be beyond the end of
the string, 0 (zero) will be returned.  Otherwise, 1 (one) will be
returned.

Index wrapping from the end of the string (such as -2 meaning the
second to last character), or size truncation at the end of the
string, are suggested to be delegated to higher lever APIs.

This is one of the only two APIs that modify their parent string,
the other one is insert_characters().

=item replace_characters

        INTVAL insert_characters(INTVAL index, INTVAL character, INTVAL size);

Replace C<size> characters in the string at the character index
C<index> with C<size> copies of the character C<character>.

If the index or the end of the updated data would be beyond the end of
the string, 0 (zero) will be returned.  Otherwise, 1 (one) will be
returned.

Index wrapping from the end of the string (such as -2 meaning the
second to last character), or size truncation at the end of the
string, or string extension (probably using insert_characters()),
are suggested to be delegated to higher lever APIs.

=item find_substring

        INTVAL find_substring(parrot_string *substring, INTVAL index, INTVAL 
direction);

Given a character index to a string find the abs(direction)-th
occurrence after (if direction > 0) or before (if direction < 0)
the C<index>-th character.  If a match is found, the offset to the
match is returned, NO_MATCH otherwise.

If the C<direction> is zero, the substring is compared against the
parent string, at the C<index>th character.  If they match, the offset
to the match is returned, NO_MATCH otherwise.

=item to_uppercase

        INTVAL to_uppercase(INTVAL code_point, parrot_string *upperstring);

Given a code point, the uppercase string (if not NULL) is updated to
contain the Unicode uppercase version of the code point as described
in the Unicode Technical Report #21, I<Case Mappings>,
http://www.unicode.org/unicode/reports/tr21/ .  If there is an
uppercase mapping for the code point, 1 (one) is returned,
0 (zero) otherwise.

Note that simple one code point to another code point mapping would
not be enough since case mapping is much more complex than that.  The
same applies to to_lowercase(), to_titlecase(), and to_foldcase().

=item to_lowercase

        INTVAL to_lowercase(INTVAL code_point, parrot_string *lowerstring);

Given a code point, the lowercase string is updated to contain the
Unicode lowercase version of the code point as described in the
Unicode TR #21.  If there is an lowercase mapping for the code point,
1 (one) is returned, 0 (zero) otherwise.

=item to_titlecase

        INTVAL to_titlecase(INTVAL code_point, parrot_string *titlestring);

Given a code point, the titlecase string is updated to contain the
Unicode titlecase version of the code point as described in the
Unicode TR #21.  If there is an titlecase mapping for the code point,
1 (one) is returned, 0 (zero) otherwise.

=item to_foldcase

        INTVAL to_foldcase(INTVAL code_point, parrot_string *foldstring);

Given a code point, the foldcase string is updated to contain the
Unicode case folding version of the code point as described in the
Unicode TR #21.  If there is a case folding mapping for the code
point, 1 (one) is returned, 0 (zero) otherwise.

B<Note that case folding is the version of the text to use for
case-ignoring matching, not for lowercasing.>  (Unicode calls 
case-ignoring I<loose matching>.) (In Perl: foldcase is used
for C<//i>, but lowercase for C<lc()>.)  Foldcase and lowercase
are very similar transformations, but they are not identical.

=item get_property

        void* get_property(INTVAL code_point, INTVAL table, INTVAL property);

Given a code point, a Unicode data table, and a property,
return a pointer to the value of the property.  For example:

        /* get the general category (such as "Lu" for uppercase letter) */
        get_property(code_point, UNICODE_DATA, UNICODE_GENERAL_CATEGORY);

        /* get the script (such as "Greek") */
        get_property(code_point, UNICODE_SCRIPTS,     UNICODE_SCRIPT);

        /* get the uppercase */
        get_property(code_point, UNICODE_DATA,        UNICODE_LOWERCASE);

        /* get the foldcase */
        get_property(code_point, UNICODE_CASEFOLDING, UNICODE_FOLDCASE);

        typedef enum {
            GET_PROPERTY_ERROR_INTERNAL =  0,
            GET_PROPERTY_ERROR_TABLE    = -1,
            GET_PROPERTY_ERROR_PROPERTY = -2
        } PARROT_STRING_GET_PROPERTY_ERROR;

=over 8

=item *

In case of an internal error, (void*)GET_PROPERTY_ERROR_INTERNAL
is returned.

=item *

If there is no said table, (void*)GET_PROPERTY_ERROR_TABLE
is returned.

=item *

If there is no said property, (void*)GET_PROPERTY_ERROR_PROPERTY
is returned.

=item *

In all other cases (such as the name of the script at the code point,
or the lowercase version of a code point), a pointer to the value is
returned.  Such values are read-only so if you want to modify them you
must explicitly copy them.  If the value is a Unicode string, a pointer
to a PARROT_SIZED_STRING is returned.

=back

The loose matching of property names (for example as described in
the Unicode Technical Report #18, I<Unicode Regular Expression
Guidelines>, http://www.unicode.org/unicode/reports/tr18/ ),
is suggested to be delegated to higher level APIs.

=back

=head2 Normalization

Normalization is a process where equivalent forms of characters are
first decomposed and then composed to have equivalent binary
representations (since a single character may be represented in
several different ways).  The normalized forms are useful for
string equivalence comparisons, but one probably cannot automatically
convert everything to a canonical normalized form because then the
I/O equivalence (you read in data and write out data: are the data
identical?) could easily be broken.

        typedef enum {
            UNICODE_NFD,
            UNICODE_NFC,
            UNICODE_NFKD,
            UNICODE_NFKC
        } UNICODE_NORMALIZATION;

        typedef struct {
            UNICODE_NORMALIZATION       normalization_form;
            PARROT_SIZED_STRING         sized_string;
        } PARROT_NORMALIZED_STRING;

B<The preferred normalization form of Parrot is the NFKC.>

NKFC stands for Normalization Form KC: Compatibility Decomposition,
followed by Canonical Composition.  Fast low-level transforms between
the normalization forms are required.

Note that any string operations should primarily update the primary
C<sized_string> of a C<parrot_string> and invalidate the C<normalized_string>
by first releasing the storage for the normalized string and then setting
the C<normalized_string> pointer in the primary string to a NULL pointer.
When the normalized version of the string is called for, the normalized
version will be computed and cached on-demand.

For more information see the Unicode Technical Report #15,
I<Unicode Normalization Forms>,
http://www.unicode.org/unicode/reports/tr15/

=head2 Collation

In Unicode collation (sorting) 

        typedef struct {
            INTVAL* (*collation_element)(INTVAL code_point);
            INTVAL  size;
            INTVAL* sort_key;
        } PARROT_COLLATED_STRING;

The C<collation> element function pointer is used to return (at least)
three-element weight vectors for code points, and those weights are
then used to compute the C<sort_key>, which is an array of unsigned
integers.  The sort keys can then be binary-compared to produce the
correct comparison.  The three (or more) levels of weighing can be
customised.

For proper implementation of Unicode collation, Unicode normalization
(decomposition to canonical mappings) is required.

For more information see the Unicode Technical Report #10,
I<Unicode Collation Algorithm>,
http://www.unicode.org/unicode/reports/tr10/

Similarly to normalization, any string operations should primarily
update the C<sized_string> of a C<parrot_string> and invalidate the
C<sort_key_string> by first releasing the storage for the sort key and
then setting the C<collated_string> pointer in the primary string to
a NULL pointer.  When the collated version of the string is called
for, the sort key will be computed and cached on-demand.

=head1 OUTSIDE WORLD

Any other character repertoires than Unicode or any other Unicode
encodings than UTF-8, UTF-16, and UTF-32 must be converted to Parrot
strings.  This "must" includes even the common US-ASCII and ISO 8859-1
(Latin-1).  For true 7-bit US-ASCII the task is luckily easy because
the 128 characters of US-ASCII are identical to the UTF-8 encoded
Unicode.  But for any eight-bit CES such as the ISO 8859, the Windows
code pages, KOI8-R, or EBCDIC, a conversion is necessary.

Note that UTF-EBCDIC is a CES, not a CEF, and therefore UTF-EBCDIC
should not be used for internal storage, only for external storage
and I/O.  The main reason for UTF-EBCDIC is for easy interchange
of Unicode data between EBCDIC systems, for more information see
the Unicode Technical Report #16, I<UTF-EBCDIC>,
http://www.unicode.org/unicode/reports/tr21/

The method of conversion to and from Parrot strings is expected
to be outside the Parrot core, in the source code parsing and
in the I/O subsystem.

One notable thing is that if using UTF-16 or UTF-32 the I/O subsystem
should automatically either recognise (if reading) or generate (if
writing) the appropriate BOMs (byte order markers) and then write the
stream appropriately either in BE or LE (big endian or little endian).

For UTF-8 the use of the BOM is optional but recommended.
(--can't find anything for or against this on the Unicode web site
   but at least having a BOM would be consistent--)

=head1 TO DO

Not much thought has been given to string searching or pattern
matching interfaces.  Boyer-Moore will work just fine with UTF-8
since no false matches can happen thanks to the design of UTF-8.

About the implementation of character classes: since the Unicode code
point range is big, a single big bitmap won't work any more: firstly,
it would be big.  Secondly, for most cases, it would be wastefully
sparse.  A balanced binary tree of (begin, end) points of ranges is
suggested.  That would seem to give the required flexibility and
reasonable compromise betwen speed and space for implementing the
operations required by both the traditional regular expression
character classes (complement, case-ignorance) and the new Unicode
character class semantics (difference, intersection) (see the Unicode
Technical Report #18, I<Unicode Regular Expression Guidelines>,
http://www.unicode.org/unicode/reports/tr18/ )

Another, possible simpler way would be to use inversion lists:
1-dimensional arrays where odd (starting from zero) indices store
the beginnings of ranges belonging to the class, and and even indices
store the beginnings of ranges not belonging to the class.
Note "array" instead of (a linked) "list": with an array one can do
binary search to determine membership (so an inversion list is in
effect a flattened binary tree).  Yet another way would be to use
various two-level table schemes.  The choice of the appropriate data
structure, as always, depends on the expected operational (read vs
modify) mix and the expected data distribution.

Especially little thought has been given to the new scheme of having
regular expression opcodes as part of the core opcodes.  The
next_character() and previous_character() are possibly of some use
(note that a general random access naturally won't work in variable
length data like UTF-8).  Some additional notes: beware of the 1:N and
N:1 effects of upper/lowercasing casefolding: any assumptions/optimisations
concerning the expected (minimum) length of matches, and even the
apparent length of the target string get more complicated.  A single
character in a literal or a character class may match more than one
target character; and vice versa.  Look-ahead and look-behind need
also to consider this variability.

If IBM ICU is picked as the low-low level implementation of Unicode
(and as the source of locale data), this whole document needs to be
sanity checked against the ICU architectures and APIs.

Not much thought has been given to the interplay of encodings and I/O,
converting encodings on the fly, or the related newline conventions,
those issues have been delegated to the source code parsing and I/O
subsystem discussions.  One notable thing about these, though, is that
there should be at least two levels of "name resolution".  The lower
level should accept only one form of a name, such as "ISO 8859-1", but
the higher level should accept looser matching, such as "iso8859_1",
or "Latin 1", or "latin-1", and so forth.  Some encodings like the
CJKV might require extra hooks for automagic detection of the encoding.

No thought has been given to the effects of either reference counting
or garbage collection.

Not much thought has been given into multithreaded operation where
Parrot text might be shared between different threads.  The design
needs to be validated from that viewpoint.  The C<index_offset> of
PARROT_SIZED_STRING is the most likely victim.

=head1 FUTURE

It is important to remember that Unicode is an evolving standard:
new characters get added through a formal process, errors in character
properties gets fixed, and so forth.  Therefore it is of utmost
importance to track the development of the Unicode process and keep
the Parrot implementation of Unicode current.

on parrot strings

Reply via email to