[EMAIL PROTECTED] skribis 2008-01-10 16:14 (-0800): > +(Note that C<.bytes> is not guaranteed to be well-defined when the encoding > +is unknown.
(This message is a mess; in my defense, it's 5:30 AM here. I just had to respond, because I have the feeling Perl 6's unicode model is going exactly the wrong way if I interpret this diff correctly.) What if the encoding is known, but by accessing the .bytes level one breaks the consistency? Rather than a scheme where unicode text strings have an encoding property, I think a scheme where unicode text strings are just unicode text strings is better: the *binary* strings can have an encoding property. So: * A Str is a sequence of codepoints, and provides grapheme/glyphs if requested. It does not have bytes, and the internal encoding does not show except through introspection. The internal encoding can theoretically change at Perl's will. * A Buf is a sequence of bytes, not codepoints or characters of any kind. * A Buf with a defined .encoding: - does Str, with transparent decoding (with validity checking) - also, transparent encoding my Str $foo = "H€łłø wöŕłđ"; my Buf $bar; $bar.encoding = "utf-8"; # or however a decoding is declared $bar = $foo; # gets utf-8 encoded $bar.bytes; # [ "H", "\xE2", "\x82", "\xAC", ... ] $bar.codes; # [ "H", "€", "ł", ... ] $foo.codes eqv $bar.codes # true $foo.bytes; # Huh? What? Makes no sense -> fail All byte-oriented mechanisms can have an encoding defined somehow: filehandles, environment variables, Bufs, system call wrappers. I think that would work much easier than giving Strs encoding properties. When writing to a file, or a Buf, you're probably using a lot of Strs, and it would make no sense to have them all encode differently. Likewise, when reading from IO, a Buf, or anything byte-oriented, the encoding will have to be known to decode it. I fail to see how giving a Str any .bytes or .encoding might make sense: these belong to byte strings, not text strings. Making it easy to work with the internal byte buffer will take away means of optimization, ease of changing our mind about the best implementation encoding, and either security or performance (Either you check the consistency every time you do something and everything is slow, or you don't, and everything is potentially insecure when passed on to other code.) Of course, the current internal encoding and byte buffer should be accessible somehow, and maybe even writable for the brave of heart, but IMO certainly not with the way too encouraging .bytes thing - I'm tempted to call for .HOW.internal. I think that a Buf with a defined encoding should check whether the data is valid when reading, but a Str can skip this: Perl itself put the data there, and trusts its own routines for much better performance. Please, don't give Strs any byte semantics, but do give Bufs support for transparent en-/decoding, and perhaps even unicode semantics. -- Met vriendelijke groet, Kind regards, Korajn salutojn, Juerd Waalboer: Perl hacker <[EMAIL PROTECTED]> <http://juerd.nl/sig> Convolution: ICT solutions and consultancy <[EMAIL PROTECTED]>