On Thu, 04 Dec 2003 16:21:14 -0800, Greg Webster <[EMAIL PROTECTED]> writes:

> Excellent. I am in agreement.
> 
> I've sent a raw list of all the urls in the rules to Chris Santerre wish
> a promise that one I find some time I'll write up some perl code to
> clean up and form rules out of them.
> 
> Anyone have any resources-optimization documentation for regexp in Perl?
> 

Regexps are the wrong hammer. The correct thing to use is
Aho-Corasick. It can match an arbitrary number of strings during a
single linear pass over the input.

Generally, perl behaves the best when it has a fixed prefix that it
knows must occur in the strings. IE:

  h(foo|bar) is always better than hfoo|hbar because the re engine can
  see if the h matches and reject immediately if it doesn't. This is
  perhaps the most important optimization because it can avoid the
  regexp engine entirely for most offsets. Thus this factoring should
  *always* be done.

Also in a disjunction (foo|bar|baz|bang), it must check each case
individually --- all 4, But in foo|b(ar|az|ang) it only checks 4 cases
if the input starts with a 'b', and two for any other letter. There
are random small second-order effects of having the extra disjunction
nesting. This is because perl won't use the optimized strcmp() loop
and must reenter the regexp engine. It may be that this only pays off
if there are, say, $N$ or more rules with a common prefix.
Experimentation to determine the right threshold for $N$ would be
needed. I guess somewhere between 5 and 50. 

Scott


-------------------------------------------------------
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to