Nicholas Clark wrote:
> 
> On Wed, Aug 20, 2003 at 07:19:42PM -0400, Benjamin Goldberg wrote:
> 
> > Leopold Toetsch wrote:
> 
> > > But these could be converted to utf32 as soon as they are seen.
> >
> > For a long string, that could be quite a bit of bloat.
> 
> Jarkko's view is that the combined hit of the size of the extra code to
> skip along the variable length encoding,

We've already got code to skip along a variable length encoding
(skip_forward does precisely this).

> the time taken to execute that code,

With the current code, where string_ functions take indices, not
iterators, this skipping forward already needs to be done.  Except that
it's done inside of each and every string_ function, instead of done in
a seperate string_convert_index_to_iterator function.

> (and I guess the cache misses it creates)

If we're converting indices to iterators often, then the converter will
remain in cache.  If we're smart enough to convert once, and then do
everything relative to that iterator, then the cost of the cache-miss to
load the convert function into memory will be relatively minor.

> is greater than the gain from saving space.

How much gain there is in space by keeping data in utf8, I don't know. 
This would have to be determined by examining samples of Real World utf8
data (in particular, samples of Real World data which can't be
downgraded to some singlebyte encoding).

> Particularly when the regexp engine is written assuming O(1) random
> access.

It doesn't *need* to assume O(1) random access; after all, it's never
accessing *randomly*, it's always accessing just one character away from
some other character that it's recently accessed.  Sounds like a job for
an iterator for me.  With an iterator, it needs only assume that
advancing the iterator a distance of 1, takes O(1) time.

> He thinks perl 5 would probably have been faster if it used
> UCS32 internally. Maybe ponie will.
> 
> Nicholas Clark

-- 
$a=24;split//,240513;s/\B/ => /for@@=qw(ac ab bc ba cb ca
);{push(@b,$a),($a-=6)^=1 for 2..$a/6x--$|;print "[EMAIL PROTECTED]
]\n";((6<=($a-=6))?$a+=$_[$a%6]-$a%6:($a=pop @b))&&redo;}

Reply via email to