On Thu, 2002-02-21 at 10:22, Arpi wrote:
> Hi,
> 
> > I've ran my C version through your really big spam collection at night, and
> > filtered out 'slow' messages. Then I've checked which regexps makes them so
> > slow (slow mean 5..25 secs/mail on p4 1.8ghz).
> 
> more on this...
> 
> > FOR_INSTANT_ACCESS:
> > /(?:CLICK HERE|).{0,20}\s+INSTANT\s+ACCESS.{0,20}\s+(?:|CLICK HERE)/i
> > 
> > I think its author wanted to match "CLICK HERE * INSTANT ACCESS" and
> > "INSTANT ACCESS * CLICK HERE" but in a singel common regexp.
> 
> anyway it's bad...
> it matches single " INSTANT ACCESS " without having "CLICK HERE" before or
> after it... so that part of regexp is useless and just slows things down a
> lot. faster alternative:
> 
> body FOR_INSTANT_ACCESS         /\sINSTANT\s+ACCESS.{0,20}\s+/i
> 
> correct me if i'm wrong, i'm still newbie in regexp world :)

I think
body FOR_INSTANT_ACCESS         /INSTANT ACCESS/i
is fine by itself.  I'll make that change now.

> > LINE_OF_YELLING:
> > /^[A-Z0-9\$\.,\'\!\?\s]{20,}[A-Z\$\.,\'\!\?]{5,}[A-Z0-9\$\.,\'\!\?\s]{20,}$/
> 
> it only slows down mails with lots of uppercase chars, so it isn't problem.
> (got total 8 slow checks, but they took total 1min20 secs!)
> anyway optimizing it in C using a state-machine could help.
> the main point of this rule: there must be at least 20 uppercase chars
> without lowercase between them, and at least one uppercase word longer than
> 5 chars. easy to implement in C.

I've always thought this rule was a little too complicated, and might be
better re-written as an eval method.  I'll put a note in bugzilla to do
this.

> ASCII_FORM_ENTRY:
> not an easy rule.
> 
> rawbody ASCII_FORM_ENTRY        /[^<][A-Za-z][A-Za-z]+.{1,15}?\s+_{30,}/
> 
> could someone please explain what does [^<] matches ?
> afaik ^ means beginning-of-line but it's strange in [] character array.
> so, what does ^ mean there? begin-of-line or '^' character?
> i think it's beg-of-line, as PCRE couldn't optimize this regexp with
> possible-first-chars-table. then we should split this to 2 rules. it is
> really slow at too many mails.
> (i've got 11687 slow (took longer than 1ms) checks running on your spam coll.)

Does anyone know why it cares about that [^<]?  Seems to me like the
rest of the rule is descriptive enough that it should match.  Actually,
I'd say that just /\s+_{30,}/ would be probably be a decent rule for
this.

> the remaining 2 slow rules are:
> PORN_3
> MSG_ID_ADDED_BY_MTA_2
> 
> PORN_3 begins with double (?: | ),

I'm planning on breaking PORN_3 into 2 rules -- it's not particularly
well constructed now.  I'm going to break out really-dirty-words and
just slightly-dirty-words into 2 separate rules.  The other reason
PORN_3 is probably slow is the {3,} at the end.

> MSG_ID_ADDED_BY_MTA_2 partially matches every headers (which has Message-Id:
> field) causing regexp search to be slow.

I can't think of any way to speed this up, even by using an eval instead
of regex -- you basically are trying to find headers where the
Message-id field is immediately followed by a Received header, but the
Message-id field wasn't from yahoo.


_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to