Re: Bit ops on strings

Jeff Clites Sun, 02 May 2004 11:37:54 -0700

Two more things to keep in mind:

On May 1, 2004, at 4:54 PM, Aaron Sherman wrote:

If Perl defaults to UTF-8

People need to realize also that although UTF-8 is a pretty good interchange format, it's a really bad in-memory representation. This is for at least 2 related reasons: (1) To get to the N-th logical character, you need to start all the way at the beginning and scan forward, so your access time is O(N). For instance, the 1000th character of a string might start anywhere between byte 1000 and byte 4000. (2) Even once you've located the right byte position, you need to do computational work to unwind the bytes into the value they represent. A third reason is that Japanese text will take up three bytes per character.

I still think that's ok, and better than
representation-expanding to the larger representation and doing the
bit-op in that, since that  means that bit-vectors would have to be
valid in enum_stringrep_one, _two and _four as sort of alternate
datastructures. I don't think we want to go there.

I'm not sure it's relevant, since I think Dan's completely changing everything, but my original intention was that rep_one v. two v. four were just different ways of storing integers in optimally compact ways. There wasn't supposed to be any externally-visible behavior difference between them. For instance, you might end up with something in rep_four which could have been represented in rep_one--is so, you'd be wasting a bit of memory, but you should never be able to tell, in terms of the API. With rep_one, you _know_ all of the numbers in your list have to be < 256; with rep_four, they might be, but you'd have to check (and if you check, you should downsize to rep_one probably). Those three representation choices were just a space-based optimization--they weren't supposed to lead to different behaviors.

If this continues to be so contentious, I'm tempted to agree with the
nay-sayers and say that Parrot shouldn't do bit-vectors on strings, and
we should just implement a bit-vector class later on.

Yes, that does have the benefit of clarity/simplicity/non-contentiousness.

Perl will just have to suffer the overhead of translation.

Yep, though actually there's no reason why there couldn't be two distinct PMCs, which just happen to look the same from a Perl5 point of view. There wouldn't have to be a translation overhead, necessarily (at least, in cases where you don't do both binary-ish and text-ish operations on the same scalar).

It tends to be easier to have a distinction, and pretend it's not there in certain circumstances, than to lack a distinction, and try to make everything work out sensibly (text operations on binary data, binary operations on textual data).

JEff

Re: Bit ops on strings

Reply via email to