Re: [svn:perl6-synopsis] r14489 - doc/trunk/design/syn

Juerd Waalboer Thu, 10 Jan 2008 20:38:17 -0800

[EMAIL PROTECTED] skribis 2008-01-10 16:14 (-0800):
> +(Note that C<.bytes> is not guaranteed to be well-defined when the encoding
> +is unknown.


(This message is a mess; in my defense, it's 5:30 AM here. I just had to
respond, because I have the feeling Perl 6's unicode model is going
exactly the wrong way if I interpret this diff correctly.)

What if the encoding is known, but by accessing the .bytes level one
breaks the consistency?

Rather than a scheme where unicode text strings have an encoding
property, I think a scheme where unicode text strings are just unicode
text strings is better: the *binary* strings can have an encoding
property.

So:

* A Str is a sequence of codepoints, and provides grapheme/glyphs if
  requested. It does not have bytes, and the internal encoding does not
  show except through introspection. The internal encoding can
  theoretically change at Perl's will.
* A Buf is a sequence of bytes, not codepoints or characters of any
  kind.
* A Buf with a defined .encoding:
  - does Str, with transparent decoding (with validity checking)
  - also, transparent encoding

my Str $foo = "H€łłø wöŕłđ";
my Buf $bar;
$bar.encoding = "utf-8";  # or however a decoding is declared
$bar = $foo;  # gets utf-8 encoded
$bar.bytes;   # [ "H", "\xE2", "\x82", "\xAC", ... ]
$bar.codes;   # [ "H", "€", "ł", ... ]
$foo.codes eqv $bar.codes  # true
$foo.bytes;   # Huh? What? Makes no sense -> fail

All byte-oriented mechanisms can have an encoding defined somehow:
filehandles, environment variables, Bufs, system call wrappers.

I think that would work much easier than giving Strs encoding
properties. When writing to a file, or a Buf, you're probably using a
lot of Strs, and it would make no sense to have them all encode
differently. Likewise, when reading from IO, a Buf, or anything
byte-oriented, the encoding will have to be known to decode it.

I fail to see how giving a Str any .bytes or .encoding might make sense:
these belong to byte strings, not text strings.

Making it easy to work with the internal byte buffer will take away
means of optimization, ease of changing our mind about the best
implementation encoding, and either security or performance (Either you
check the consistency every time you do something and everything is
slow, or you don't, and everything is potentially insecure when passed
on to other code.) Of course, the current internal encoding and byte
buffer should be accessible somehow, and maybe even writable for the
brave of heart, but IMO certainly not with the way too encouraging
.bytes thing - I'm tempted to call for .HOW.internal.

I think that a Buf with a defined encoding should check whether the data
is valid when reading, but a Str can skip this: Perl itself put the data
there, and trusts its own routines for much better performance.

Please, don't give Strs any byte semantics, but do give Bufs support for
transparent en-/decoding, and perhaps even unicode semantics.
-- 
Met vriendelijke groet,  Kind regards,  Korajn salutojn,

  Juerd Waalboer:  Perl hacker  <[EMAIL PROTECTED]>  <http://juerd.nl/sig>
  Convolution:     ICT solutions and consultancy <[EMAIL PROTECTED]>

Re: [svn:perl6-synopsis] r14489 - doc/trunk/design/syn

Reply via email to