Re: Bit ops on strings

Nicholas Clark Tue, 25 May 2004 04:30:23 -0700

On Sun, May 02, 2004 at 11:37:31AM -0700, Jeff Clites wrote:
> Two more things to keep in mind:
> 
> On May 1, 2004, at 4:54 PM, Aaron Sherman wrote:
> 
> >If Perl defaults to UTF-8
> 
> People need to realize also that although UTF-8 is a pretty good 
> interchange format, it's a really bad in-memory representation. This is 
> for at least 2 related reasons: (1) To get to the N-th logical 
> character, you need to start all the way at the beginning and scan 
> forward, so your access time is O(N). For instance, the 1000th 
> character of a string might start anywhere between byte 1000 and byte 
> 4000. (2) Even once you've located the right byte position, you need to 
> do computational work to unwind the bytes into the value they 
> represent. A third reason is that Japanese text will take up three 
> bytes per character.


> I'm not sure it's relevant, since I think Dan's completely changing 
> everything, but my original intention was that rep_one v. two v. four 
> were just different ways of storing integers in optimally compact ways. 
> There wasn't supposed to be any externally-visible behavior difference 
> between them. For instance, you might end up with something in rep_four 
> which could have been represented in rep_one--is so, you'd be wasting a 
> bit of memory, but you should never be able to tell, in terms of the 
> API. With rep_one, you _know_ all of the numbers in your list have to 
> be < 256; with rep_four, they might be, but you'd have to check (and if 
> you check, you should downsize to rep_one probably). Those three 
> representation choices were just a space-based optimization--they 
> weren't supposed to lead to different behaviors.

I may be misremembering what I've read here but I thought that Dan said
that for variable length encodings (such as shift-JIS) parrot would store
the byte(s) in memory in constant size 16 or 32 bit integers, rather than
the (external) variable length byte sequence, as this gives O(1) random
access, and avoids much coding pain.

However, he made no explicit comment about UTF8 (just another variable
length encoding), which would imply that parrot will be storing UTF8 in
this way. However, I was assuming that internally the UTF8 will immediately
get converted to either UTF-32BE or UTF32LE (or 16 bit or 8 bit if possible)
as this avoids worrying about code points > 0x4000000 cropping up in UTF8
(which if I have it right are 5 bytes long, so would bust Dan's scheme),
on the assumption that UTF8 input will be validity checked at input time,
(at least to do the code point splitting) and writing the values out as
32 bit quantities there and then will take virtually no more CPU, but
save lots later.

Or is this now all gone, because there will be Unicode everywhere and
strings will get converted at input IO time?

Nicholas Clark

Re: Bit ops on strings

Reply via email to