This and other RFCs are available on the web at
  http://dev.perl.org/rfc/

=head1 TITLE 

Internal String Storage to be Opaque

=head1 VERSION 

        Maintainer: Simon Cozens <[EMAIL PROTECTED]>
        Date: Aug 18 2000 
        Version: 1 
        Mailing List: [EMAIL PROTECTED]
        Number: 131

=head1 ABSTRACT 

Perl 5.6.0 tried to mix UTF8 and non-UTF8 strings inside SVs, meaning
that every time you had to do something with a string, you had to check
whether it was UTF8 or not, and if so probably do something slightly
different to it instead. A single internal encoding which is opaque to
the programmer Would Be Nice.

=head1 DISCUSSION

=over 3 

=item Why a single internal encoding?

Let's imagine what happens if we have a string encoded in UTF8 and a
string encoded in UTF16 and we need to concatenate them. What on earth
do you do here? A byte-by-byte concatenation with no encoding change is
B<probably> the wrong thing, but it might not be. So, what happens at
present is that one or the other has to get recoded. You can either
change the original string to the new encoding (which makes you wonder
why you bothered having separate encodings anyway) or you can take a
copy and convert that. (which is the same, only more expensive)

A single encoding gets rid of all this mess.

=item Won't it mess up binary-level manipulation?

'Course not. This is the internal representation we're talking about.
It's only a convenience to the core, and doesn't relate specifically to
what's coming into or going out of Perl from the user's point of view.

=item Oh, yuck, doesn't that mean a lot of costly conversions?

When data enters and leaves the core? Hmm, yes, I suppose it does. But
that's life. Sorry.

=item Huh, OK. So what's the internal representation going to be?

That's the thing. It doesn't matter. It B<shouldn't> matter. Keep it
pluggable; you could have everything in Latin1, in UTF8, in UTF16, or
who knows what, but the core developer shouldn't have to care. One good
way to achieve this is to have the string presented in the variable as
an array of wchars or similar. The point is that it's the same for
everything, so comparisons don't suck.

=item Won't this lead to incompatibilities between builds?

Maybe. But would it matter? The end user certainly doesn't care what
internal representation is used, and any module or program authors that
depend on that can be found guilty of "unwarranted chumminess". The only
difference would be between builds that use Unicode and those that
don't.

=item So I bet you're going to insist that everyone uses Unicode?

You got it. But of course, you get to choose your UTF. For what it's
worth.

=back

=head1 REFERENCES 

http://perlhacker.org/articles/unihandle.html


Reply via email to