On Mon, 8 Dec 2003 18:25:49 -0600 (CST), David B Funk <[EMAIL PROTECTED]> writes:
> > > ie: V...i..a.gr..a > > > > > > As I suggested in my email, there's lots of combinations that spammers > > > can do to avoid the original rule. There's also lots of ways to > > > construct the rule to get a broader hit-base, at the expense of > > > greater processing time. > > > > In theory, this isn't that much additional matching time, especially > > with an automata. In practice though, these sorts of rules will kill > > performance because Perl cannot apply the literal optimization, > > especially if they're applied widely. (There's more than just Vxxxxx > > -- most of the phrase rules need this sort of treatment.) > > > > Scott > > Scott, > If it's a bounded wild-card (".{0,6}") as opposed to unbounded (".*") > is it less of a hit? (IE reasonable thing to do). For an DFA matcher like I'm writing, bounded wild-card is far more expensive. For perl, it will make the worst case (or perhaps it should be called the malicious case) faster, but would be unlikely to make the typical case faster. Generally, perl will take linear time at least when the following conditions are met: #2: For all repetitions of a r, in r*s, s's prefix should be disjoint from r. IE a*bad GOOD a*arghh BAD [abc]*[def] GOOD [abc]*[cde] BAD (c is in both sets) [Vv][ _.,]*[Ii][ _.,]*[Gg] GOOD #3: The above may be nested within disjunction or repetition. (a*b|c*d|b*x) GOOD (At *, read that to be any sort of bounded or unbounded repetition.) As a set of more general advice, Perl runs a nondeterminstic backtracking matcher. IE, it tries to match *each* of the alternatives in the RE to the string and if it finds a failure on all of them, it backtracks to a prior choice and tries its alternatives. Only if every alternative fails will it gives up and assert a match failure. This matching happens left-to-right. Thus, the sooner it can detect a failure, the faster it goes. For instance: ab|ac|ad|ae is a little slower than a(b|c|d|e) because in the first case, it must attempt to match an input against each pattern one at a time, in the second case, it can bail out early if the input doesn't match an 'a'. Rule #2 is perhaps the most important rule for re's. Whenever it is followed, then the re engine provably can bail quickly when it is trying alternatives. When it is violated, the RE engine may in some cases have an extreme amount of nondeterminism (IE, lots of alternatives). I've looked over SA's regexps for an analysis on worst case regexp matching and the only sticking point was: http://article.gmane.org/gmane.mail.spam.spamassassin.general/25773/match=regular 2 '[\-.0-9=A-Z_a-z]+ <at> \w+(?:[\-.0-9A-Z_a-z]+\.)+\w+' Feed it a bunch of dot's followed by a non-word... Say... 'f <at> blah.................................................#' and a few variants. > Are there any reasonably simple ways to do this with out killing things? If you follow rule #2 above, things will be good. > (EG .? == OK, .* == BAD, .{0,n} == acceptable, for small values of 'n') .* may be bad because unless it is followed by a newline, .*\n, there is always nondeterminism in trying to match it. abc.*def.*ghi is bad from a worst-case perspective, because with an input along the lines of: abcdef gh abcdef gh abcdef gh abcdef gh abcdef gh abcdef gh repeat for about 4kb worth takes several seconds to fail to match. Observe: perl -Mre=debug -e '"abcdef gh abcdef gh abcdef gh abcdef gh abcdef gh abcdef gh" =~ /abc.*def.*ghi/' 2>&1 | less Bounding the quantification with .{,10} is better than nothing, but the worst case still exists. Fortunately, for many of SA's patterns, Perl's literal substring optimization: anchored `abc' at 0 floating `def' at 3..2147483647 (checking floating) minlen 9 makes most of the SA patterns impossible to attack. Thats luck, not design. > Are there any studies of the Perl matching engine for efficiency > and rules-of-thumb? Mine? The other stuff out there is regrettably out of date. The nightmare of perl regexps that take years to match was ended by regexp engine changes in perl a few years ago. See my comments in the thread: Simplifying BigEvilList rules from 4 days ago. I explain a few optimizations that occur outside of the regexp engine. SA doesn't as a rule really exercise the regexp engine with hideous patterns, so those rules are generally more relevant. For an example of the optimization it uses, see the line: anchored `abc' at 0 floating `def' at 3..2147483647 (checking floating) minlen 9 from the perl line above. This means that the regexp optimizer determined that any match for the pattern must both start with 'abc' and contain an 'def'. As an optimization, the regexp engine isn't invoked unless the at the current position meets those requirements. Summarizing, the most important rule is to AVOID NONDETERMINISM IN YOUR PATTERN. A second rule is to start your pattern with a fixed string. This is a bit long, but I hope it helps. Also, this is perl specific. with a different engines --- like a DFA one, all bets are off and everything you know is wrong. :) Scott ------------------------------------------------------- This SF.net email is sponsored by: IBM Linux Tutorials. Become an expert in LINUX or just sharpen your skills. Sign up for IBM's Free Linux Tutorials. Learn everything from the bash shell to sys admin. Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk