Re: Strings Manifesto

Jeff Clites Fri, 30 Apr 2004 09:00:31 -0700

On Apr 28, 2004, at 11:25 PM, Leopold Toetsch wrote:

Jeff Clites <[EMAIL PROTECTED]> wrote:

On Apr 28, 2004, at 4:57 AM, Bryan C. Warnock wrote:

Does (that which the masses normally refer to as) binary data
fall inside or outside the scope of a string?

Some languages make this very clear by providing a separate data type
to hold a "blob of bytes".


Back to Parrot, which isn't covered by the manifesto. But anyway we
already need[1] "enum_stringrep_blob" or "_bytes".

Certainly, for the things you've listed under [1] there's no problem with using a separate data type.

I can't imagine that
we use a different data type, this would totally mess with Perl
compatibility.

Not necessarily (or, that wasn't my intention). For Ponie, we can do this:

1) Just always implicitly assume "iso-8859-1" when creating strings which Perl5 would have interpreted as binary.

2) To handle certain features of Perl5 semantics, we could set a flag, at the PerlString level, to indicate that it should have Perl5-ish semantics. (That depends on wether a string created in Perl5 code and passed to Perl6 code should act Perl5-ish or Perl6-ish there. That is, is its semantics set by its creation context or its use context.) See below for an example of a case I'm thinking where the semantics might differ:

We must ensure that such a string is never upscaled to another string representation. We can do all byte-wise operations on such a string, but e.g. appending an utfX string or such should be an error.

Although, Perl5 lets you append a "utf-8" string to a "binary" string. But the behavior is odd. For instance, consider this Perl5 behavior (not sure if it's a feature or a bug):

$a = chr(0xC8); $b = substr($a.chr(0x212b), 0, 1); # append a "utf-8" character, then pull it off

print $a; # these print....
print $b; # ...the same thing

print lc($a); # these print...
print lc($b); # ...different things

if( $a eq $b ) { print "yes" } # this prints yes

So, in Perl5, not only does the behavior of a (non-utf-8?) string change if it "touches" something utf-8-ish, but it does this despite "eq" telling us the strings are the same. (And, since lc() has no effect on $a, the implication is that the string is sort of half-ASCII-half-binary; that is, case mapping has not effect on characters > 127, which implies they are somehow "uninterpreted"?)

But this behavior could be accommodated (if it's not a bug) at the PerlString level by special-casing the relevant operations for the Ponie case.

The main problem currently seems to be IO, where the best thing would be to move the current hacks into a separate layer above the buffered layer. An additiional parameter for open (or layer manipulation features) can select byte-wise IO.

Yes, my intention there was for read-as-strings, you'd push a string-ification layer onto the stack. For byte-wise IO, you wouldn't.

[1]
- transparent IO
  e.g. $ parrot md5sum.imc a.out
- freeze/thaw
- writing packfiles from PASM


JEff

Re: Strings Manifesto

Reply via email to