On Saturday 05 January 2008 01:26:48 Patrick R. Michaud wrote:

> As of r24557 I've rewritten find_cclass and find_not_cclass
> so that they use a string iterator instead of repeated calls
> to ENCODING_GET_CODEPOINT.  I also improved utf8_set_position
> a bit so that it doesn't always have to restart position
> counting from the beginning of the string.  As a result,
> compiling the actions.pl script on my machine goes from 39s to
> a little over 28s -- about a 25% speed increase.

Profile-wise, these two changes represent a 21.6% increase.  That's not bad at 
all!

utf8_skip_forward() is about seven times faster now, while utf8_set_position() 
is about 1.5 times faster.

> It's likely that even with these improvement we still do a
> fair bit of position counting.  For example, utf8_skip_forward
> and ENCODING_GET_CODEPOINT are probably still called a fair bit --
> if nothing else, the is_cclass opcode uses them, as do other
> operations.  Some sort of memoization for utf8_skip_forward might
> give us even more speedups, but the amount of improvement would
> really depend on how/when these are being called.
>
> I think it will still be worthwhile to investigate
> converting strings into a fixed-width encoding of some sort
> instead of performing scans on variable-width encodings.

Agreed... if we figure out our Unicode strategy.

-- c

Reply via email to