-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Robert Menschel writes: >What I have been able to measure is the time needed for a mass check. >When I run mass-check against my now 50k corpus (that's 50k email >messages), it takes 15-16 minutes to run for a single rule. Adding a >small number of rules doesn't seem to have much impact. However, when I >ran your full set of 4800 rules in one pass, mass check took 1.5 hours. > >We can figure this two ways: >* 4800 rules takes 75 minutes longer than 1 rule, therefore it takes >0.0156 minutes = 0.938 seconds per rule >* 4800 rules x 50k messages takes 90 minutes. Therefore 4800 rules x 1 >message should take 0.11 seconds. The experience of those who attempted >to apply Chris' full EvilRules set indicates this is not a valid analysis >(1700 rules is too much to add to busy email servers). That's *exactly* the methodology. For more info, install Devel::DProf and use rm tmon.out perl -d:DProf mass-check -j1 ... dprofpp -O 999 > dprof.out (-j1 so it doesn't fork). Then you can figure out from dprof.out which rules in particular are slow -- the rules are compiled into individual perl subroutines and the output is sorted by runtime. e.g.: 0.26 0.020 0.020 82 0.0002 0.0002 Mail::SpamAssassin::PerMsgStatus:: __NIGERIAN_BODY_8_body_test means the time taken to run __NIGERIAN_BODY_8 across that corpus. This applies for most rules, except for headers, which are all compiled into 1 sub, and eval rules, which keep their own subs, BTW. (You may have to rerun the command if it dumps core -- there's a wierd startup bug with Devel::DProf, but it doesn't cause anything worth worrying about.) >>> Are there ways to improve the performance of the checks? I ask >>> because these URI rules are tripping on about 50-60% of my current >>> spam - much more than the corresponding source domain blacklist rules. Quick speed tips: .* = slow lookaheads or lookbehinds = very slow anchoring with \b = fast anchoring with ^, $ = faster >Performance improvements? Maybe. And I don't know whether any of this >will help -- it'll take experimentation unless the developers have some >answers here. > >Possibility 1: combine rules. If you can combine 10 tests into a single >rule, >> uri rulename /(?:spammer1|spammer2|s3|s4|s5|s6|s7|s8|s9|s10)\.com/i >then you'll have only 480 rules, not 4800. I don't know if this will have >any impact, but maybe... That *will* help -- but at the expense of being able to catch FPing rules and fix them easily. You sacrifice a *lot* of readability that way. >Possibility 2: bound the rules. I noted that the URI for 16.com matched >significant ham. Test for /\bdomain/ and maybe it'll run a trifle >faster. yes. If you can bound at the start of the URL it'll probably be faster still... - --j. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.2 (GNU/Linux) Comment: Exmh CVS iD8DBQE/uZ2kQTcbUG5Y7woRAiXLAKCqeHE2Ahu3WuCvsr+90vbicJzxJwCcCAQi bebTcXRV3LRt9/h1Hu4lZQA= =LFa7 -----END PGP SIGNATURE----- ------------------------------------------------------- This SF. Net email is sponsored by: GoToMyPC GoToMyPC is the fast, easy and secure way to access your computer from any Web browser or wireless device. Click here to Try it Free! https://www.gotomypc.com/tr/OSDN/AW/Q4_2003/t/g22lp?Target=mm/g22lp.tmpl _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk