On Thu, 04 Dec 2003 16:21:14 -0800, Greg Webster <[EMAIL PROTECTED]> writes:
> Excellent. I am in agreement. > > I've sent a raw list of all the urls in the rules to Chris Santerre wish > a promise that one I find some time I'll write up some perl code to > clean up and form rules out of them. > > Anyone have any resources-optimization documentation for regexp in Perl? > Regexps are the wrong hammer. The correct thing to use is Aho-Corasick. It can match an arbitrary number of strings during a single linear pass over the input. Generally, perl behaves the best when it has a fixed prefix that it knows must occur in the strings. IE: h(foo|bar) is always better than hfoo|hbar because the re engine can see if the h matches and reject immediately if it doesn't. This is perhaps the most important optimization because it can avoid the regexp engine entirely for most offsets. Thus this factoring should *always* be done. Also in a disjunction (foo|bar|baz|bang), it must check each case individually --- all 4, But in foo|b(ar|az|ang) it only checks 4 cases if the input starts with a 'b', and two for any other letter. There are random small second-order effects of having the extra disjunction nesting. This is because perl won't use the optimized strcmp() loop and must reenter the regexp engine. It may be that this only pays off if there are, say, $N$ or more rules with a common prefix. Experimentation to determine the right threshold for $N$ would be needed. I guess somewhere between 5 and 50. Scott ------------------------------------------------------- This SF.net email is sponsored by: IBM Linux Tutorials. Become an expert in LINUX or just sharpen your skills. Sign up for IBM's Free Linux Tutorials. Learn everything from the bash shell to sys admin. Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk