On Tue, 2010-08-10 at 11:19 +0300, Henrik K wrote: > Runtime for different methods (memory used including Perl itself): > > - Single 70000 name regex, 20s (8MB) > - 7 regexes of 10000 names each, 141s (9MB) > - "Martin style", lookups from Perl hash, 8s (12MB) > Very interesting indeed. Thanks for trying it. I'm not surprised that the set of 7 regexes took longer than the one big one, but I am surprised that the time difference is so close to the factor of 7.
Out of interest, did you leave the headers in your test messages? I did initially when I developed the generic name matches, but then removed them because most of the hits were in headers while the real-life scan-and-compare rule would only be applied to the body. Of course, there should be almost no difference if there is no match in a message, but on average we can guess that the single regex will do 35,000 attempted matches for every candidate name pair that generates a hit while the set of seven will do 65,000 attempted matches (6 x 10000 for the six regexes that don't contain the match and 5000 on average for the one that does). One thing this experiment makes clear is that a rule containing a lot of alternates, such as one scanning the body for misspelt words, will perform better if it contains one long regex rather than a set of shorter regexes plus an OR meta to combine them - the latter is easier to maintain but slower running. In the past I used the second form but now I always use a single long regex that is built from a rule definition file with my 'portmanteau' script - its rule definition file is easy to maintain because it holds each alternate pattern on a separate line. Martin