On Tue, 2010-08-10 at 11:19 +0300, Henrik K wrote:
> Runtime for different methods (memory used including Perl itself):
> 
> - Single 70000 name regex, 20s (8MB)
> - 7 regexes of 10000 names each, 141s (9MB)
> - "Martin style", lookups from Perl hash, 8s (12MB)
> 
Very interesting indeed. Thanks for trying it. I'm not surprised that
the set of 7 regexes took longer than the one big one, but I am
surprised that the time difference is so close to the factor of 7.

Out of interest, did you leave the headers in your test messages? I did
initially when I developed the generic name matches, but then removed
them because most of the hits were in headers while the real-life
scan-and-compare rule would only be applied to the body. 

Of course, there should be almost no difference if there is no match in
a message, but on average we can guess that the single regex will do
35,000 attempted matches for every candidate name pair that generates a
hit while the set of seven will do 65,000 attempted matches (6 x 10000
for the six regexes that don't contain the match and 5000 on average for
the one that does).

One thing this experiment makes clear is that a rule containing a lot of
alternates, such as one scanning the body for misspelt words, will
perform better if it contains one long regex rather than a set of
shorter regexes plus an OR meta to combine them - the latter is easier
to maintain but slower running. 

In the past I used the second form but now I always use a single long
regex that is built from a rule definition file with my 'portmanteau'
script - its rule definition file is easy to maintain because it holds
each alternate pattern on a separate line.
 

Martin


Reply via email to