On Thu, 2002-02-21 at 10:22, Arpi wrote: > Hi, > > > I've ran my C version through your really big spam collection at night, and > > filtered out 'slow' messages. Then I've checked which regexps makes them so > > slow (slow mean 5..25 secs/mail on p4 1.8ghz). > > more on this... > > > FOR_INSTANT_ACCESS: > > /(?:CLICK HERE|).{0,20}\s+INSTANT\s+ACCESS.{0,20}\s+(?:|CLICK HERE)/i > > > > I think its author wanted to match "CLICK HERE * INSTANT ACCESS" and > > "INSTANT ACCESS * CLICK HERE" but in a singel common regexp. > > anyway it's bad... > it matches single " INSTANT ACCESS " without having "CLICK HERE" before or > after it... so that part of regexp is useless and just slows things down a > lot. faster alternative: > > body FOR_INSTANT_ACCESS /\sINSTANT\s+ACCESS.{0,20}\s+/i > > correct me if i'm wrong, i'm still newbie in regexp world :)
I think body FOR_INSTANT_ACCESS /INSTANT ACCESS/i is fine by itself. I'll make that change now. > > LINE_OF_YELLING: > > /^[A-Z0-9\$\.,\'\!\?\s]{20,}[A-Z\$\.,\'\!\?]{5,}[A-Z0-9\$\.,\'\!\?\s]{20,}$/ > > it only slows down mails with lots of uppercase chars, so it isn't problem. > (got total 8 slow checks, but they took total 1min20 secs!) > anyway optimizing it in C using a state-machine could help. > the main point of this rule: there must be at least 20 uppercase chars > without lowercase between them, and at least one uppercase word longer than > 5 chars. easy to implement in C. I've always thought this rule was a little too complicated, and might be better re-written as an eval method. I'll put a note in bugzilla to do this. > ASCII_FORM_ENTRY: > not an easy rule. > > rawbody ASCII_FORM_ENTRY /[^<][A-Za-z][A-Za-z]+.{1,15}?\s+_{30,}/ > > could someone please explain what does [^<] matches ? > afaik ^ means beginning-of-line but it's strange in [] character array. > so, what does ^ mean there? begin-of-line or '^' character? > i think it's beg-of-line, as PCRE couldn't optimize this regexp with > possible-first-chars-table. then we should split this to 2 rules. it is > really slow at too many mails. > (i've got 11687 slow (took longer than 1ms) checks running on your spam coll.) Does anyone know why it cares about that [^<]? Seems to me like the rest of the rule is descriptive enough that it should match. Actually, I'd say that just /\s+_{30,}/ would be probably be a decent rule for this. > the remaining 2 slow rules are: > PORN_3 > MSG_ID_ADDED_BY_MTA_2 > > PORN_3 begins with double (?: | ), I'm planning on breaking PORN_3 into 2 rules -- it's not particularly well constructed now. I'm going to break out really-dirty-words and just slightly-dirty-words into 2 separate rules. The other reason PORN_3 is probably slow is the {3,} at the end. > MSG_ID_ADDED_BY_MTA_2 partially matches every headers (which has Message-Id: > field) causing regexp search to be slow. I can't think of any way to speed this up, even by using an eval instead of regex -- you basically are trying to find headers where the Message-id field is immediately followed by a Received header, but the Message-id field wasn't from yahoo. _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk