String Theory

Rod Adams Sat, 19 Mar 2005 15:07:57 -0800


I propose that we make a few decisions about strings in Perl. I've read
all the synopses, several list threads on the topic, and a few web
guides to Unicode. I've also thought a lot about how to cleanly define
all the string related functions that we expect Perl to have in the face
of all this expanded Unicode support.

What I've come up with is that we need a rule that says:

A single string value has a single encoding and a single Unicode Level
associated with it, and you can only talk to that value on its own
terms. These will be the properties "encoding" and "level".

However, it should be easy to coerce that string into something that
behaves some other way.

To accomplish this, I'm hijacking the C<as> method away from the Perl 5
C<sprintf> (which can be named C<to>, and which I plan to do more with
at some later point), and making it a general purpose coercion method.
The general form of this will be something like:

  multi method as ($self : ?Class $to = $self.meta.name, *%options)

The purpose of C<as> is to create a "view" of the invocant in some other
form. Where possible, it will return a lvalue that allows one to alter
the original invocant as if it were a C<$to>.

This makes several things easy.

 my Str $x = 'Just Another Perl Hacker' but utf8;
 my @x := $x.as(Array of uint8);
 say "@x.pop() @x.pop()";
 say $x;

Generates:

 114 101
 Just Another Perl Hack

To make things easier, I think we need new types qw/Grapheme CodePoint
LangChar/ that all C<does Character> (ick! someone come up with a
better name for this role), along with Byte. Character is a role,
not a class, so you can't go creating instances of it.

But we could write:

 my Str $x = 'Just Another Perl Hacker';
 my @x := $x.as(Array of Character);

And then C<@x.pop()> returns whichever of
Grapheme/CodePoint/LangChar/Byte that $x thought of itself in terms of.
In other words, it's C<chop>.


Since by default, C<as> assumes the invocant type, we can convert from
one string encoding/level to another with:

 $str.as(encoding => 'utf8', level => 'graph');

But we'll make it where C<*%options> handles known encodings and levels
as boolean named parameters as well, so

 $str.as:utf8:graph;

does the same thing: makes another Str with the same contents as $str,
only with utf8 encoding and grapheme character semantics.


What does all this buy us? Well... for one thing it all disappears if
you want the default semantics of what you're working with.

Second, it makes it where a position within a string can be thought of
as a single integer again. What that integer means is subject to the
C<level> of the string you're operating with.

We could probably even resurrect C<length> if we wanted to, making it
where people who don't care about Unicode don't have to care. Those who
do care exactly which length they are getting can say
C<length $str.as:graph>.

To the user, almost the entire string function library winds up looking
like it did in Perl 5.


Some side points:

It is an error to do things like C<index> with strings of different
levels, but not different encodings.

level and encoding should default to whatever the source code was
written in, if known.

C<pack> and C<unpack> should be able to be replaced with C<as> views of
compact structs (see S09).

C<as> kills C<vec>. Or at least buries it very deeply, without oxygen.


Comments?

-- Rod Adams

String Theory

Reply via email to