On Thu, 21 Aug 2003, Elizabeth Mattijsen wrote:

> At 14:15 +0100 8/21/03, Nicholas Clark wrote:
> >On Wed, Aug 20, 2003 at 07:19:42PM -0400, Benjamin Goldberg wrote:
> >  > Leopold Toetsch wrote:
> >  > > But these could be converted to utf32 as soon as they are seen.
> >  > For a long string, that could be quite a bit of bloat.
> >Jarkko's view is that the combined hit of the size of the extra code to skip
> >along the variable length encoding, the time taken to execute that code,
> >(and I guess the cache misses it creates) is greater than the gain from
> >saving space.
> 
> Indeed.  I think available memory has increased more than 4 fold 
> since the first regexp engine that could only do 1-byte ASCII.  So 
> relatively, I don't think that bloat is an issue.  Just don't do 
> regexps on 256Mbyte strings when your machine has less than 1 GByte 
> RAM   ;-)

FWIW, we're not going to do string ops on UTF-8 stuff. We'll understand 
it, and know how to translate it to more useful forms, but it's just a 
static storage format for us. (Mainly because, while working with UTF-8 
strings is a massive pain, it's foolish to transform it to UTF16 or UTF32 
if we don't need to) Our unicode operations will be done either on UTF-16 
(if we get ICU going, since that's what it uses) or UTF-32. -8 is a 
legacy/storage format only so far as we're concerned.

THe same thing goes for other variable-width encodings such as Shift-JIS, 
FWIW.

                                        Dan

Reply via email to