I propose that we make a few decisions about strings in Perl. I've read all the synopses, several list threads on the topic, and a few web guides to Unicode. I've also thought a lot about how to cleanly define all the string related functions that we expect Perl to have in the face of all this expanded Unicode support.
What I've come up with is that we need a rule that says:
A single string value has a single encoding and a single Unicode Level associated with it, and you can only talk to that value on its own terms. These will be the properties "encoding" and "level".
However, it should be easy to coerce that string into something that behaves some other way.
To accomplish this, I'm hijacking the C<as> method away from the Perl 5 C<sprintf> (which can be named C<to>, and which I plan to do more with at some later point), and making it a general purpose coercion method. The general form of this will be something like:
multi method as ($self : ?Class $to = $self.meta.name, *%options)
The purpose of C<as> is to create a "view" of the invocant in some other form. Where possible, it will return a lvalue that allows one to alter the original invocant as if it were a C<$to>.
This makes several things easy.
my Str $x = 'Just Another Perl Hacker' but utf8; my @x := $x.as(Array of uint8); say "@x.pop() @x.pop()"; say $x;
Generates:
114 101 Just Another Perl Hack
To make things easier, I think we need new types qw/Grapheme CodePoint LangChar/ that all C<does Character> (ick! someone come up with a better name for this role), along with Byte. Character is a role, not a class, so you can't go creating instances of it.
But we could write:
my Str $x = 'Just Another Perl Hacker'; my @x := $x.as(Array of Character);
And then C<@x.pop()> returns whichever of Grapheme/CodePoint/LangChar/Byte that $x thought of itself in terms of. In other words, it's C<chop>.
Since by default, C<as> assumes the invocant type, we can convert from one string encoding/level to another with:
$str.as(encoding => 'utf8', level => 'graph');
But we'll make it where C<*%options> handles known encodings and levels as boolean named parameters as well, so
$str.as:utf8:graph;
does the same thing: makes another Str with the same contents as $str, only with utf8 encoding and grapheme character semantics.
What does all this buy us? Well... for one thing it all disappears if you want the default semantics of what you're working with.
Second, it makes it where a position within a string can be thought of as a single integer again. What that integer means is subject to the C<level> of the string you're operating with.
We could probably even resurrect C<length> if we wanted to, making it where people who don't care about Unicode don't have to care. Those who do care exactly which length they are getting can say C<length $str.as:graph>.
To the user, almost the entire string function library winds up looking like it did in Perl 5.
Some side points:
It is an error to do things like C<index> with strings of different levels, but not different encodings.
level and encoding should default to whatever the source code was written in, if known.
C<pack> and C<unpack> should be able to be replaced with C<as> views of compact structs (see S09).
C<as> kills C<vec>. Or at least buries it very deeply, without oxygen.
Comments?
-- Rod Adams