RFC 322 (v1) Abstract Internals String Interaction

Perl6 RFC Librarian Tue, 26 Sep 2000 22:34:56 -0700
This and other RFCs are available on the web at
  http://dev.perl.org/rfc/

=head1 TITLE

Abstract Internals String Interaction

=head1 VERSION

  Maintainer: Simon Cozens <[EMAIL PROTECTED]>
  Date: 26 Sep 2000
  Mailing List: [EMAIL PROTECTED]
  Number: 322
  Version: 1
  Status: Developing

=head1 ABSTRACT

All accesses to strings and portions of strings inside the Perl 6 core
should be abstracted.

=head1 DESCRIPTION

Although RFC 294 states that data will be in UTF8 internally, there's
going to come a time when we'll need to move to UTF16 or other
encodings. If we start fiddling around "knowing" that we're dealing with
UTF8 data, we're going to make it hard on ourselves when it comes to
migrate.

But not only that, UTF8 is really rather gammy to deal with. It's a lot
nicer if we can just treat each string as an array of wide characters.

Take, for instance, the operation of changing a character U+2654 (WHITE
CHESS KING) into U+003F (QUESTION MARK) in the 3rd position of a 12
character string. Because UTF8 is a variable-width character set, we
have a big problem: U+2654 is 3 bytes wide, and U+003F is one byte, we
have to shift the rest of the string down two bytes, effectively having
to rewrite it. 

If we used a function or macro to change this single character, it would
also have to rewrite the whole string; using this several times would be
horrifically inefficient.

Even if we "pretend" that we're dealing with an array of wide chars, and
used macros or functions to allow us to keep up this pretence, we'd
still have to do that shifting around, which is nasty. The only way to
get it working without the inefficiency would be to B<really> deal with
an array of wide chars, and have functions to convert such an array from
and to our character set.

The obvious question would be, then, why bother with RFC 294 or UTF8 at
all? Why not just store these arrays along with our SVs? Several
reasons, notably:

=over 1

=item Memory overhead

Most of the world's data is still eight-bit; it's probably still even
seven bit. Storing it all as, effectively, raw Unicode is wasteful.

=item I/O Convenience

We can't get an array of wide chars from the user, but we can get UTF8
from them. It's also likely we'll want to output data in a UTF, at least
at times. 

=back

=head1 IMPLEMENTATION

The following interface is proposed:

    wchar_t *  Perl_utf_to_raw( U8* string,      STRLEN* len )
    U8 *       Perl_raw_to_utf( wchar_t* string, STRLEN* len )

All functions which manipulate strings should convert them to raw form
and operate on the array, converting it back when it's time to store it
in the string.

=head1 REFERENCES

RFC 294: Internally, data is stored as UTF8

As-yet-unsubmitted RFC : When UTF8 Leaks Out
RFC 322 (v1) Abstract Internals String Interaction

Reply via email to