RFC 300 (v1) C and C

Perl6 RFC Librarian Mon, 25 Sep 2000 13:12:40 -0700
This and other RFCs are available on the web at
  http://dev.perl.org/rfc/

=head1 TITLE

C<use unicode::representation> and C<no unicode>

=head1 VERSION

  Maintainer: Simon Cozens <[EMAIL PROTECTED]>
  Date: 25 Sep 2000
  Mailing List: [EMAIL PROTECTED]
  Number: 300
  Version: 1
  Status: Developing

=head1 ABSTRACT

Perl 5.6's C<use bytes> is a useful pragma; this RFC stipulates its
intended behaviour in Perl 6 B<right now>.

=head1 DESCRIPTION

When Perl 5.6 introduced Unicode support, there suddenly became two ways
of handling data: you can handle it as a series of Unicode characters,
or you can manipulate it byte-for-byte. The C<use bytes> pragma was used
to force byte-for-byte manipulation.

The problem with this was mainly of understanding; you had to know when
Perl was handling your data as UTF8 before C<use bytes> made any sense,
and then it looked like you were letting people fiddle with the internal
representation of data, something that should be hidden from them.

This was only B<really> a problem because some data was encoded and some
wasn't. If we're using a consistent Unicode representation internally,
and that's known and documented, the root of the problem goes away; 
since everything's going to be converted to UTF8 when it enters Perl, it
makes sense to want to get at that UTF8. It's considerably less messy
now.

But why? Why would someone want to get at the UTF8? One reason, which I
believe to be sufficient enough to justify the pragma, noted in
"Normalisation and C<unicode::exact>" was to compare representations
without decomposition. C<unicode::exact> gives you non-destructive
comparisons, but it doesn't give you byte-for-byte comparisons. 
C<use bytes> would do this. Other uses which have been noted on p5p
recently include XML and sending data over the network.

However, I want to keep all the Unicode-related pragmata consistently
named, so I suggest that C<use bytes> be renamed to 
C<use unicode::representation>; C<use bytes> appeared to be a rather
confusing name anyway.

As part of the "What does use bytes mean?" discussion/debacle/holy war,
Nick Ing-Simmons expressed the desire for a new pragma which turned off
Unicode altogether. Let's put that in, and call it C<no unicode>. 

=head1 IMPLEMENTATION

The exact semantics of C<no unicode> will be trashed out in "Unicode
Combinatorix"; C<use unicode::representation> can be implemented
similarly to C<use bytes>.

=head1 REFERENCES

L<bytes>

the as-yet-unsubmitted RFCs on 

RFC 295: Normalisation and C<unicode::exact>

RFC ??: When UTF8 Leaks Out

RFC 312: Unicode Combinatorix

RFC 311: Line Disciplines
RFC 300 (v1) C and C

Reply via email to