Re: regular expressions was: Kernel Oops

mouss Tue, 08 Mar 2011 15:03:48 -0800

[WARNING: Steven CC'd]

Le 08/03/2011 21:29, Stan Hoeppner a écrit :
> Wietse Venema put forth on 3/8/2011 10:39 AM:
>> Stan Hoeppner:
>>> So, the question is, which form of expression processes the "does not
>>> match" case faster?  The fully qualified expression, or the simple
>>> expression?  Noel mentioned that the fully qualified expressions will
>>> tend to process faster.  Is this true?  Is it true for both the
>>> "matches" and "does not match" case?
>>
>> I would expect better performance when patterns only match the text
>> that needs to be matched.
>


to get better performance, one would use patterns that fail to match as
soon as possible. I mean if you have /^a..../, then the check would stop
as soon as the first char isn't an "a". but the expressions we would
like to match and the expressions we see are completely different
things. so I'd say, do not consider performances as a primary target. go
for catching spammers first. only tune after you get the irght rules,
and only if needed (I personally don't tune anything here. I'm happy to
focus on catching spammers).

> So this would mean the simpler expressions would be faster?

No. /^a(complex blah)/ is faster than /joe/ because the first will stop
if the first char sin't "a" whatever is the rest of the expression.

>  That makes
> me wonder why Enemies List[1] uses complex expressions, each one
> precisely matching a specific rDNS pattern, given EL matches 65k+
> patterns total. 

as said above, the goal isn't performance (to improve performance, buy
better hardware or run multiple instances). The goal of Steven is to
maximize hit rate while minimizing false positives. many of us have
created rules to block generic/dynamic/silly senders. when doing so, you
can start by being precise at the risk of doing a lot of work because
your rules minimise FPs, or going the other side by using expressions
that block a lot of senders inclusing legitimate ones, that is
increasing the FP rate. it takes time and efforts to get a good balance,
and that's what Steven work is about.

>[snip]
> 
>> If you must match a very large numbers of patterns, you need an
>> implementation that transforms N patterns into one deterministic
>> automaton. This can match 1 pattern in the same time as N patterns.
>> Once the automaton is built (which takes some time) it is blindingly
>> fast. An example of such an implementation is flex.
> 
> This sounds really interesting.  Do you have a link to info about this
> flex software?  I'd like to read about it.
> 

[note: it wasn't me who said the text above. I however studied the
problem, in a completely different context. I can tell you one thing:
forget about optimizing your pcre rules. optimisation is useful in DNA
matching problems and the like. and even then...).

>> Similar optimizations are needed for large CIDR maps. Right now,
>> Postfix's linear search does 10^8 patterns/s. With this, postscreen
>> can search the largest ipdeny.com file in 1ms on a modern CPU,
>> which is sufficient for the moment. To make it fast, the CIDR
>> entries need to be arranged into a tree that can be traversed in
>> log(N) time.
> 
> I recall you and Viktor discussing this a while ago.  I don't really
> understand how an OP (myself) would go about creating a tree of our CIDR
> tables.  Or is this something that the Postfix CIDR code would handle?
> 

if cidr is to be enhanced, then it would be done inside cidr
implementation. the problem is the usual one: algorithms are often said
to be k*O(f(n)). so you generally prefer f(n)=log(n) over f(n)=n^2. but
this is only good for large n, and n is never large, so you need to
remember about the k constant. said otherwise: k1 * n^2 < k2 < log(n)
for small n under some conditions.

> 
> [1] Enemies List is not available for Postfix, yet, and the intelligence
> dataset is not free, although the source code is open.  EL is integrated
> in some commercial AS appliances and commercial mail software.  I
> mention it frequently here because it is the only antispam tool I'm
> aware of that makes almost exclusive use of regexes to identify likely
> spam sources, and it uses 10s of thousands of regexes.
> 

I don't use EL, but I think it is usable with postfix. Steven, can you
confirm this? (some of the features may be sendmail oriented, but it
would be easy to "generalize" them).

Re: regular expressions was: Kernel Oops

Reply via email to