Re: String representation

Kai Henningsen Mon, 18 Dec 2000 14:18:49 -0800
[EMAIL PROTECTED] (Jarkko Hietaniemi)  wrote on 15.12.00 in 
<[EMAIL PROTECTED]>:

> On Fri, Dec 15, 2000 at 12:13:01PM +0000, Simon Cozens wrote:
> > IMHO, the first thing we need to design and code is the API and runtime
> > library, since everything else builds on top of that, and we can design
> > other stuff in parallel with coding it. (A lot of it will be grunt work.)
> >
> > So, before we start even thinking about what we need, it's time to look at
> > the vexed question of string representation. How do we do Unicode without
> > getting into the horrendous non-Latin1 cockups we're seeing on p5p right
> > now? Larry
>
> As painful as it may sound (codingwise) I would urge to spare some
> thought to using (internally) UTF-32 for those encodings for which
> UTF-8 would be *longer* than the UTF-32 (mainly the Asian scripts).

No such animal.

The highest actually allocated code will still fit into 21 bits (actually,  
20 bits + 64K), that's a design decision (which means that UTF-16 works  
forever). As long UTF-8 sequences have n*5+1 bits usable data, 21 bits  
means 4 bytes - which is the same size as UTF-32.

If your string has just one character that fits in three bytes (3*5+1 = 16  
bits, or "old" Unicode allocations, including *all* allocations as of  
Unicode 3.0), it's shorter with UTF-8.

Of course, it's *faster* with UTF-32 as long as you don't bust your cache.

MfG Kai
Re: String representation

Reply via email to