Hello Bryan,

Sunday, January 11, 2004, 6:58:58 PM, you wrote:

>> I've just completed documenting my current system at
>> http://www.exit0.us/index.php/BobCorpusTest

BH> So the good news is, I'm now downloading Cygwin!  I've brought down the
BH> default packages which took from 7:07pm to 8:22pm, connected at 32kbps.

BH> Having now installed the basic packages, I'm now working on those
BH> required for an SA installation.

I'm glad I was able to help.

BH> I was wondering about the part:

RM>> The time required for a ruleset of a dozen or so rules takes
RM>> basically the same amount of time as a single rule, so when multiple
RM>> rules are to be tested, they may as well be tested within a single
RM>> file. The time will grow significantly as your corpus grows.
RM>> mass-check runs for the full distribution ruleset (or for hundreds
RM>> of rules) will take significantly longer than mass-check runs for
RM>> small numbers of rules."    

BH> Is this making a distinction between number of separate files across
BH> which rules are distributed?  Or is the running time variation here, as
BH> I would hope/think, a function of number of rules, without regard for
BH> the number of files involved?  It's that, "they may as well be tested
BH> within a single file," that's throwing me.

The number of files does not seem to make any difference at all.

The two items that do have impact are the number of emails being scanned
by mass-check (the size of your ham corpus and your spam corpus) and the
number of rules.

The impact of any single additional rule seems to be negligible. When I
run a mass-check against a file which contains one rule, it seems to take
just as long as a file which contains 10 rules or 20 rules.

When I run my full custom rule file through mass-check, with over 1,500
rules, yes that takes quite a bit longer than a 20-rule file.

But because each individual rule doesn't seem to have any impact, it's
*much* more efficient for me to run a mass-check with 20 rules in it than
to run five passes each with four rules.

Of course, the amount of ham and spam in your corpus does have an impact.
I have over 17,000 ham and 70,000 spam in my corpus right now, and my run
time for a single rule has gone from 20 minutes when I first started
using cygwin to something just less than 2 hours. (Note that I am
generally doing other things on my computer at the same time this is
running.)

Did that answer your question?

Bob Menschel





-------------------------------------------------------
This SF.net email is sponsored by: Perforce Software.
Perforce is the Fast Software Configuration Management System offering
advanced branching capabilities and atomic changes on 50+ platforms.
Free Eval! http://www.perforce.com/perforce/loadprog.html
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to