Hi,

> I've ran my C version through your really big spam collection at night, and
> filtered out 'slow' messages. Then I've checked which regexps makes them so
> slow (slow mean 5..25 secs/mail on p4 1.8ghz).

more on this...

> FOR_INSTANT_ACCESS:
> /(?:CLICK HERE|).{0,20}\s+INSTANT\s+ACCESS.{0,20}\s+(?:|CLICK HERE)/i
> 
> I think its author wanted to match "CLICK HERE * INSTANT ACCESS" and
> "INSTANT ACCESS * CLICK HERE" but in a singel common regexp.

anyway it's bad...
it matches single " INSTANT ACCESS " without having "CLICK HERE" before or
after it... so that part of regexp is useless and just slows things down a
lot. faster alternative:

body FOR_INSTANT_ACCESS         /\sINSTANT\s+ACCESS.{0,20}\s+/i

correct me if i'm wrong, i'm still newbie in regexp world :)

> LINE_OF_YELLING:
> /^[A-Z0-9\$\.,\'\!\?\s]{20,}[A-Z\$\.,\'\!\?]{5,}[A-Z0-9\$\.,\'\!\?\s]{20,}$/

it only slows down mails with lots of uppercase chars, so it isn't problem.
(got total 8 slow checks, but they took total 1min20 secs!)
anyway optimizing it in C using a state-machine could help.
the main point of this rule: there must be at least 20 uppercase chars
without lowercase between them, and at least one uppercase word longer than
5 chars. easy to implement in C.

ASCII_FORM_ENTRY:
not an easy rule.

rawbody ASCII_FORM_ENTRY        /[^<][A-Za-z][A-Za-z]+.{1,15}?\s+_{30,}/

could someone please explain what does [^<] matches ?
afaik ^ means beginning-of-line but it's strange in [] character array.
so, what does ^ mean there? begin-of-line or '^' character?
i think it's beg-of-line, as PCRE couldn't optimize this regexp with
possible-first-chars-table. then we should split this to 2 rules. it is
really slow at too many mails.
(i've got 11687 slow (took longer than 1ms) checks running on your spam coll.)

the remaining 2 slow rules are:
PORN_3
MSG_ID_ADDED_BY_MTA_2

PORN_3 begins with double (?: | ),
MSG_ID_ADDED_BY_MTA_2 partially matches every headers (which has Message-Id:
field) causing regexp search to be slow.

i have no idea how to speed up these.

other regexps are rare or fast enough.

Note: by changing only FOR_INSTANT_ACCESS as described above, i've got
45mins->29mins (~30%) speedup. so, it DOES worth to optimize/verify regexps!


A'rpi / Astral & ESP-team

--
Developer of MPlayer, the Movie Player for Linux - http://www.MPlayerHQ.hu

_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to