On Saturday 05 January 2008 01:26:48 Patrick R. Michaud wrote: > As of r24557 I've rewritten find_cclass and find_not_cclass > so that they use a string iterator instead of repeated calls > to ENCODING_GET_CODEPOINT. I also improved utf8_set_position > a bit so that it doesn't always have to restart position > counting from the beginning of the string. As a result, > compiling the actions.pl script on my machine goes from 39s to > a little over 28s -- about a 25% speed increase.
Profile-wise, these two changes represent a 21.6% increase. That's not bad at all! utf8_skip_forward() is about seven times faster now, while utf8_set_position() is about 1.5 times faster. > It's likely that even with these improvement we still do a > fair bit of position counting. For example, utf8_skip_forward > and ENCODING_GET_CODEPOINT are probably still called a fair bit -- > if nothing else, the is_cclass opcode uses them, as do other > operations. Some sort of memoization for utf8_skip_forward might > give us even more speedups, but the amount of improvement would > really depend on how/when these are being called. > > I think it will still be worthwhile to investigate > converting strings into a fixed-width encoding of some sort > instead of performing scans on variable-width encodings. Agreed... if we figure out our Unicode strategy. -- c