Re: Bit ops on strings

Nicholas Clark Wed, 26 May 2004 02:02:55 -0700

On Tue, May 25, 2004 at 07:48:32PM -0700, Jeff Clites wrote:
> On May 25, 2004, at 12:26 PM, Dan Sugalski wrote:


> >Yup. UTF8 is Just another variable-width encoding. Do anything with it 
> >and we convert it to a fixed-width encoding, in this case UTF32.
> 
> This has the unfortunate side-effect of wasting 50-75% of the storage 
> space in the common cases, of course.

True. But variable length encodings suck performance wise. Jarkko wrote a
caching layer for perl 5.8.1 to store pairs of UTF8/byte offsets, and
even though there is now much more complexity and usually only one pair
cached the feeling was that it accelerated some operations by a factor of 10.
It seems that the O(n) for random access hurts much more than the memory
usage. But you can't win.

Jarkko's view was that if he were to implement Unicode in perl5 again, he'd
go internally for fixed width, UCS 32 (IIRC).

The only thing that might be useful to cache on a UTF8 string is the highest
code point seen, so that we know whether to unpack to 8, 16 or 32 bit without
a scan. Presumably we can find this when we input validate on the
"conversion" from binary to UTF8.

Nicholas Clark

Re: Bit ops on strings

Reply via email to