Nicholas Clark wrote: > > On Wed, Aug 20, 2003 at 07:19:42PM -0400, Benjamin Goldberg wrote: > > > Leopold Toetsch wrote: > > > > But these could be converted to utf32 as soon as they are seen. > > > > For a long string, that could be quite a bit of bloat. > > Jarkko's view is that the combined hit of the size of the extra code to > skip along the variable length encoding,
We've already got code to skip along a variable length encoding (skip_forward does precisely this). > the time taken to execute that code, With the current code, where string_ functions take indices, not iterators, this skipping forward already needs to be done. Except that it's done inside of each and every string_ function, instead of done in a seperate string_convert_index_to_iterator function. > (and I guess the cache misses it creates) If we're converting indices to iterators often, then the converter will remain in cache. If we're smart enough to convert once, and then do everything relative to that iterator, then the cost of the cache-miss to load the convert function into memory will be relatively minor. > is greater than the gain from saving space. How much gain there is in space by keeping data in utf8, I don't know. This would have to be determined by examining samples of Real World utf8 data (in particular, samples of Real World data which can't be downgraded to some singlebyte encoding). > Particularly when the regexp engine is written assuming O(1) random > access. It doesn't *need* to assume O(1) random access; after all, it's never accessing *randomly*, it's always accessing just one character away from some other character that it's recently accessed. Sounds like a job for an iterator for me. With an iterator, it needs only assume that advancing the iterator a distance of 1, takes O(1) time. > He thinks perl 5 would probably have been faster if it used > UCS32 internally. Maybe ponie will. > > Nicholas Clark -- $a=24;split//,240513;s/\B/ => /for@@=qw(ac ab bc ba cb ca );{push(@b,$a),($a-=6)^=1 for 2..$a/6x--$|;print "[EMAIL PROTECTED] ]\n";((6<=($a-=6))?$a+=$_[$a%6]-$a%6:($a=pop @b))&&redo;}