On Mon, 8 Dec 2003 18:25:49 -0600 (CST), David B Funk <[EMAIL PROTECTED]> writes:

> > > ie: V...i..a.gr..a
> > >
> > > As I suggested in my email, there's lots of combinations that spammers
> > > can do to avoid the original rule. There's also lots of ways to
> > > construct the rule to get a broader hit-base, at the expense of
> > > greater processing time.
> >
> > In theory, this isn't that much additional matching time, especially
> > with an automata. In practice though, these sorts of rules will kill
> > performance because Perl cannot apply the literal optimization,
> > especially if they're applied widely. (There's more than just Vxxxxx
> > -- most of the phrase rules need this sort of treatment.)
> >
> > Scott
> 
> Scott,


> If it's a bounded wild-card (".{0,6}") as opposed to unbounded (".*")
> is it less of a hit? (IE reasonable thing to do).

For an DFA matcher like I'm writing, bounded wild-card is far more
expensive. 

For perl, it will make the worst case (or perhaps it should be called
the malicious case) faster, but would be unlikely to make the typical
case faster.

Generally, perl will take linear time at least when the following
conditions are met:

   #2: For all repetitions of a r, in r*s, s's prefix should be
   disjoint from r.

       IE     a*bad    GOOD
              a*arghh  BAD
              [abc]*[def]  GOOD
              [abc]*[cde]  BAD  (c is in both sets)
              [Vv][ _.,]*[Ii][ _.,]*[Gg]  GOOD
              
   #3: The above may be nested within disjunction or repetition.
              (a*b|c*d|b*x)  GOOD

(At *, read that to be any sort of bounded or unbounded repetition.)

As a set of more general advice, Perl runs a nondeterminstic
backtracking matcher. IE, it tries to match *each* of the alternatives
in the RE to the string and if it finds a failure on all of them, it
backtracks to a prior choice and tries its alternatives. Only if every
alternative fails will it gives up and assert a match failure.  This
matching happens left-to-right. Thus, the sooner it can detect a
failure, the faster it goes.

For instance: ab|ac|ad|ae is a little slower than a(b|c|d|e) because
in the first case, it must attempt to match an input against each
pattern one at a time, in the second case, it can bail out early if
the input doesn't match an 'a'.

Rule #2 is perhaps the most important rule for re's. Whenever it is
followed, then the re engine provably can bail quickly when it is
trying alternatives. When it is violated, the RE engine may in some
cases have an extreme amount of nondeterminism (IE, lots of
alternatives). I've looked over SA's regexps for an analysis on worst
case regexp matching and the only sticking point was:

    http://article.gmane.org/gmane.mail.spam.spamassassin.general/25773/match=regular

    2   '[\-.0-9=A-Z_a-z]+ <at> \w+(?:[\-.0-9A-Z_a-z]+\.)+\w+'

    Feed it a bunch of dot's followed by a non-word...

    Say... 'f <at> blah.................................................#'

and a few variants.

> Are there any reasonably simple ways to do this with out killing things?

If you follow rule #2 above, things will be good.

> (EG .? == OK, .* == BAD, .{0,n} == acceptable, for small values of 'n')

.* may be bad because unless it is followed by a newline, .*\n, there
is always nondeterminism in trying to match it.

abc.*def.*ghi is bad from a worst-case perspective, because with an input
along the lines of:

  abcdef gh abcdef gh abcdef gh abcdef gh abcdef gh abcdef gh 

repeat for about 4kb worth takes several seconds to fail to match.

Observe:

perl -Mre=debug -e '"abcdef gh abcdef gh abcdef gh abcdef gh abcdef gh abcdef gh" =~ 
/abc.*def.*ghi/' 2>&1 | less

Bounding the quantification with .{,10} is better than nothing, but
the worst case still exists. Fortunately, for many of SA's patterns,
Perl's literal substring optimization:

   anchored `abc' at 0 floating `def' at 3..2147483647 (checking floating) minlen 9

makes most of the SA patterns impossible to attack. Thats luck, not design.

> Are there any studies of the Perl matching engine for efficiency
> and rules-of-thumb?

Mine?  The other stuff out there is regrettably out of date. The
nightmare of perl regexps that take years to match was ended by regexp
engine changes in perl a few years ago.

See my comments in the thread:

     Simplifying BigEvilList rules

from 4 days ago. I explain a few optimizations that occur outside of
the regexp engine. SA doesn't as a rule really exercise the regexp
engine with hideous patterns, so those rules are generally more
relevant.

For an example of the optimization it uses, see the line:

    anchored `abc' at 0 floating `def' at 3..2147483647 (checking floating) minlen 9

from the perl line above. This means that the regexp optimizer
determined that any match for the pattern must both start with 'abc'
and contain an 'def'. As an optimization, the regexp engine isn't
invoked unless the at the current position meets those requirements.

Summarizing, the most important rule is to AVOID NONDETERMINISM IN
YOUR PATTERN. A second rule is to start your pattern with a fixed
string.

This is a bit long, but I hope it helps.

Also, this is perl specific. with a different engines --- like a DFA
one, all bets are off and everything you know is wrong. :)

Scott


-------------------------------------------------------
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to