Re: Re[2]: [SAtalk] Sanity checking new uri rules?

Justin Mason Mon, 17 Nov 2003 21:55:39 -0800

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


Robert Menschel writes:
>What I have been able to measure is the time needed for a mass check.
>When I run mass-check against my now 50k corpus (that's 50k email
>messages), it takes 15-16 minutes to run for a single rule. Adding a
>small number of rules doesn't seem to have much impact. However, when I
>ran your full set of 4800 rules in one pass, mass check took 1.5 hours.
>
>We can figure this two ways:
>* 4800 rules takes 75 minutes longer than 1 rule, therefore it takes
>0.0156 minutes = 0.938 seconds per rule
>* 4800 rules x 50k messages takes 90 minutes. Therefore 4800 rules x 1
>message should take 0.11 seconds. The experience of those who attempted
>to apply Chris' full EvilRules set indicates this is not a valid analysis
>(1700 rules is too much to add to busy email servers).

That's *exactly* the methodology.  For more info, install Devel::DProf
and use

        rm tmon.out
        perl -d:DProf mass-check -j1 ...
        dprofpp -O 999 > dprof.out

(-j1 so it doesn't fork).  Then you can figure out from dprof.out which
rules in particular are slow -- the rules are compiled into individual
perl subroutines and the output is sorted by runtime. e.g.:


 0.26   0.020  0.020     82   0.0002 0.0002 Mail::SpamAssassin::PerMsgStatus::
                                            __NIGERIAN_BODY_8_body_test

means the time taken to run __NIGERIAN_BODY_8 across that corpus.  This
applies for most rules, except for headers, which are all compiled into 1
sub, and eval rules, which keep their own subs, BTW.

(You may have to rerun the command if it dumps core -- there's a wierd
startup bug with Devel::DProf, but it doesn't cause anything worth
worrying about.)

>>> Are there ways to improve the performance of the checks?  I ask
>>> because these URI rules are tripping on about 50-60% of my current
>>> spam - much more than the corresponding source domain blacklist rules.

Quick speed tips:

        .* = slow
        lookaheads or lookbehinds = very slow
        anchoring with \b = fast
        anchoring with ^, $ = faster

>Performance improvements? Maybe. And I don't know whether any of this
>will help -- it'll take experimentation unless the developers have some
>answers here.
>
>Possibility 1: combine rules.  If you can combine 10 tests into a single
>rule,
>> uri rulename /(?:spammer1|spammer2|s3|s4|s5|s6|s7|s8|s9|s10)\.com/i
>then you'll have only 480 rules, not 4800. I don't know if this will have
>any impact, but maybe...

That *will* help -- but at the expense of being able to catch FPing
rules and fix them easily.  You sacrifice a *lot* of readability that
way.

>Possibility 2: bound the rules.  I noted that the URI for 16.com matched
>significant ham.  Test for /\bdomain/ and maybe it'll run a trifle
>faster.

yes.  If you can bound at the start of the URL it'll probably be
faster still...

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.2 (GNU/Linux)
Comment: Exmh CVS

iD8DBQE/uZ2kQTcbUG5Y7woRAiXLAKCqeHE2Ahu3WuCvsr+90vbicJzxJwCcCAQi
bebTcXRV3LRt9/h1Hu4lZQA=
=LFa7
-----END PGP SIGNATURE-----



-------------------------------------------------------
This SF. Net email is sponsored by: GoToMyPC
GoToMyPC is the fast, easy and secure way to access your computer from
any Web browser or wireless device. Click here to Try it Free!
https://www.gotomypc.com/tr/OSDN/AW/Q4_2003/t/g22lp?Target=mm/g22lp.tmpl
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Re: Re[2]: [SAtalk] Sanity checking new uri rules?

Reply via email to