On Thu, 21 Aug 2003, Elizabeth Mattijsen wrote: > At 14:15 +0100 8/21/03, Nicholas Clark wrote: > >On Wed, Aug 20, 2003 at 07:19:42PM -0400, Benjamin Goldberg wrote: > > > Leopold Toetsch wrote: > > > > But these could be converted to utf32 as soon as they are seen. > > > For a long string, that could be quite a bit of bloat. > >Jarkko's view is that the combined hit of the size of the extra code to skip > >along the variable length encoding, the time taken to execute that code, > >(and I guess the cache misses it creates) is greater than the gain from > >saving space. > > Indeed. I think available memory has increased more than 4 fold > since the first regexp engine that could only do 1-byte ASCII. So > relatively, I don't think that bloat is an issue. Just don't do > regexps on 256Mbyte strings when your machine has less than 1 GByte > RAM ;-)
FWIW, we're not going to do string ops on UTF-8 stuff. We'll understand it, and know how to translate it to more useful forms, but it's just a static storage format for us. (Mainly because, while working with UTF-8 strings is a massive pain, it's foolish to transform it to UTF16 or UTF32 if we don't need to) Our unicode operations will be done either on UTF-16 (if we get ICU going, since that's what it uses) or UTF-32. -8 is a legacy/storage format only so far as we're concerned. THe same thing goes for other variable-width encodings such as Shift-JIS, FWIW. Dan