Hi,

I've ran my C version through your really big spam collection at night, and
filtered out 'slow' messages. Then I've checked which regexps makes them so
slow (slow mean 5..25 secs/mail on p4 1.8ghz).

Most 'slow' mails have many (>1000) repeats of a single char
(XXXXXXXXXXXXXXXXXXXXXXXXX...XXXXXXXXXXXXXXXXXX) or tab+newline pair.

the XXXXXXXXX one triggers:
slow rule ASCII_FORM_ENTRY: 0.372282s
slow rule LINE_OF_YELLING: 10.299467s
the repeat(tab+newline) one:
slow rule FOR_INSTANT_ACCESS: 7.514345s

let's see them.

FOR_INSTANT_ACCESS:
/(?:CLICK HERE|).{0,20}\s+INSTANT\s+ACCESS.{0,20}\s+(?:|CLICK HERE)/i

I think its author wanted to match "CLICK HERE * INSTANT ACCESS" and
"INSTANT ACCESS * CLICK HERE" but in a singel common regexp.
as it starts with a (|) it cannot be searched fast enough by the regexp
matcher. i think, splitting this rule to 2 rules would speed up this check a
LOT. Note, that this regexp is always much slower than other regexps,
this mail just triggers it to slow to hell.

LINE_OF_YELLING:
/^[A-Z0-9\$\.,\'\!\?\s]{20,}[A-Z\$\.,\'\!\?]{5,}[A-Z0-9\$\.,\'\!\?\s]{20,}$/
trivial, it doesn't have single fixed first char, so search is slow.
either rewritting this check in C, or using the 'study' featue of PCRE could
help. i'll try.

rawbody ASCII_FORM_ENTRY        /[^<][A-Za-z][A-Za-z]+.{1,15}?\s+_{30,}/
same as above.
there are a few rules starting with character set instead of single fixed
char, making regexp matching much slower. maybe rethink these or splitting
to several rules could help.


A'rpi / Astral & ESP-team

--
Developer of MPlayer, the Movie Player for Linux - http://www.MPlayerHQ.hu

_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to